Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

RigidSSL is a novel rigidity-aware self-supervised learning framework that improves AI-driven protein design through two-phase geometric pretraining. The method achieves up to 43% better designability in unconditional protein generation and 5.8% higher success in zero-shot motif scaffolding by learning from 432,000 AlphaFold structures and 1,300 molecular dynamics trajectories. This approach enables more realistic modeling of protein conformational ensembles, particularly for drug targets like GPCRs.

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

RigidSSL: A New AI Framework Bridges Geometric Learning Gap in Protein Design

A novel geometric pretraining framework called RigidSSL (Rigidity-Aware Self-Supervised Learning) has been introduced to overcome critical limitations in AI-driven de novo protein design. The method front-loads the learning of protein geometry before generative fine-tuning, significantly improving designability, novelty, and the modeling of realistic protein dynamics, according to a new paper (arXiv:2603.02406v1). This approach directly addresses the inability of current models to jointly learn geometry and design, their reliance on limited local representations, and their failure to capture rich conformational dynamics.

The Core Innovation: A Two-Phase Geometric Pretraining Strategy

The RigidSSL framework operates in two distinct, complementary phases to build a comprehensive understanding of protein structure. Phase I (RigidSSL-Perturb) learns foundational geometric priors from a massive dataset of 432,000 predicted structures from the AlphaFold Protein Structure Database, using simulated perturbations to teach the model about structural robustness. Phase II (RigidSSL-MD) then refines these representations on 1,300 molecular dynamics (MD) trajectories, enabling the AI to capture physically realistic transitions and conformational ensembles that are critical for function.

Underpinning both phases is a novel, bi-directional rigidity-aware flow matching objective. Unlike methods that treat atomic movements independently, this objective jointly optimizes the translational and rotational dynamics of protein regions, maximizing mutual information between different conformations. This allows the model to understand proteins as cohesive, semi-rigid bodies—a key to accurate generation and design.

Empirical Results Show Significant Performance Gains

The empirical validation of RigidSSL demonstrates substantial improvements across multiple benchmarks. In unconditional protein generation, variants of the framework improved the designability of created proteins by up to 43% while also enhancing the novelty and diversity of the outputs. For targeted design tasks, RigidSSL-Perturb improved the success rate in zero-shot motif scaffolding—where a model must build a functional protein around a given structural motif—by 5.8%.

Perhaps most notably for drug discovery, RigidSSL-MD proved highly effective at modeling complex, dynamic proteins. When applied to G protein-coupled receptors (GPCRs)—a crucial family of drug targets—the framework captured more biophysically realistic conformational ensembles than previous approaches, which is vital for understanding how these proteins interact with potential therapeutics.

Why This Matters for Computational Biology

The introduction of RigidSSL represents a paradigm shift in AI for protein engineering, moving beyond sequence-based patterns to a deep, physically-grounded understanding of 3D structure and motion.

  • Solves a Core Modeling Gap: It directly addresses the three stated limitations of current generative models by providing a dedicated pretraining stage for geometry, using global rigid-body representations, and explicitly learning dynamic transitions.
  • Enables More Realistic Design: By learning from molecular dynamics data, the AI incorporates real-world physics, leading to generated proteins that are more likely to be stable and functional.
  • Accelerates Therapeutic Discovery: Improved modeling of dynamic targets like GPCRs can significantly streamline the early-stage drug discovery pipeline, reducing time and cost.
  • Open-Source Access: The code is publicly available, allowing researchers and developers to build upon this foundational work for a wide range of computational biology and generative AI applications.

The framework is publicly available for the research community, with the code accessible at: https://github.com/ZhanghanNi/RigidSSL.git.

常见问题