This article was prepared by Alex Gurbych, Ph.D. in AI, CEO at Blackthorn.ai, and a Life Sciences expert. Alex has over 15 years of experience in AI, machine learning, and software engineering. He led AI teams in healthcare & drug discovery, with deep expertise in computer vision, NLP, and data science.
There is a vast amount of natural proteins, but potentially, there are many more that can be created to address current challenges in biotechnology. The aim of protein engineering is to concept and produce proteins with defined properties and functions.
Traditional methods of protein engineering:
- Directed evolution: creating a library with randomly mutated proteins and selecting those with desired traits (requires a lot of work and time);
- Rational design: modifications guided by structural and functional data (limited by the quality and access to structural information).
AI makes this process more efficient as it shifts from trial-and-error to a predictive, data-driven strategy.

In 2024, this industry had a market size valued at USD 4.35 billion and is predicted to increase from USD 5.09 billion in 2025 to approximately USD 20.86 billion by 2034 (CAGR of 16.97%) (Precedence Research).
In one standout case – AI-Powered Drug Dosage Optimization – we applied advanced protein and biomolecular modeling techniques to fine-tune dosage algorithms. See how we translated AI insights into real pharmacological impact.
Discover the CaseWhat Is Protein Engineering with AI?
Artificial intelligence can be useful in both protein engineering strategies:
- In directed evolution: propose mutations, predict function from sequence, substantially reducing experimental cycles.
- In rational design: without a pre-existing template or natural protein as a reference, using only biophysical and biochemical principles, can predict structure from sequence at near-experimental accuracy and enables de novo protein design (Koh et al., 2025).

AI-driven protein design roadmap(Koh et al., 2025)
Let’s look closer at how AI tools can overcome key limitations and achieve better results.
Accuracy in the prediction of protein structure:
- Extract coevolutionary patterns from homologous sequences > improved residue–residue contact maps and backbone geometry;
- Refine structural hypotheses iteratively and combine MSAs, pairwise distances, and 3D coordinates data > atomic-level prediction with experimental quality;
- PLM removes the requirement for MSAs > faster, single-sequence structure inference (particularly valuable for orphan, rapidly evolving genes or synthetic sequences);
- Combine neural network outputs with physics-based simulations > improved handling of multi-domain proteins and better generalization to complex topologies;
- Integrate coevolution, structural embeddings, and cross-modal constraints > accurate prediction of large protein complexes, assemblies, and biomolecular interactions with binding affinity estimation;
- Capture alternative conformations and flexible states > realistic representations of proteins in solution (Zhang et al., 2025).
Expand the diversity of generated proteins:
- Overcome the fixed-topology limitations > identification of novel and diverse fold architectures (including non-natural), and potential functional innovations beyond known PDB entries;
- Can design a set of new protein scaffolds for a family of proteins > avoid brute-force experimental library screening;
- Keeps the core protein fold stable while optimizing loop regions > functionally meaningful diversity without random destabilization(Koh et al., 2025; Zhang et al., 2025).
Accelerates design cycles:
- Use iterative loops: generative models propose novel sequences/folds > predictive tools evaluate foldability and binding precision > experimental results feed back into models;
- GPU-accelerated, memory-efficient inference > fast and cheap predictions at scale;
- Data-driven suggestions on mutations, prioritizations of candidates with higher chances of success;
- Scoring catalytic efficiency, binding affinity, stability, solubility, and immune response properties > fewer experimental assays(Zhang et al., 2025).
AI-powered protein engineering already has remarkable value in synthetic biology and biotechnology:
- Enhance nonnatural catalyze reactions: high efficiency and stereoselectivity, stability in concentrated organic solvent conditions (up to 70% ethanol), and thermal resistance > 90 °C;
- In target/epitope specification: compact proteins and peptides with high affinity, stability, and efficacy (researchers engineered miniproteins that neutralize snake venom toxins – 100% survival in affected mice, thermal stability > 95 °C);
- De novo design of enzymes: Kemp eliminase – the top candidate is 60x better in comparison to the initial design; promising share of generated candidates with desired activity ( serine hydrolase – 20%, carbonic anhydrase – 35%, lactate dehydrogenase – 70%);
- Better understanding of complex cell processes: engineered intracellular Ras–GTP activity sensors and proximity-labeling modules, enabled analysis of resistance mechanisms to Ras-G12C inhibitors (Zhang et al., 2025).
AlphaFold3 Overview
Millions of researchers globally have used AlphaFold 2, and its scientific impact has been recognized through many prizes.
In 2024, Google DeepMind and Isomorphic Labs released an improved version– AlphaFold 3. They also launched AlphaFold Server to enable open access to AlphaFold, including a free database of 200 million protein structures (Google).

Architecture of AF3(Malhotra et al., 2025)
Main advancements:
- GDT up to 90.1;
- Its diffusion-based architecture predicts raw atomic coordinates, with denoising random noise and capturing detailed structural features (locally and globally) > better prediction of complex structures;
- 50% more precision compared to leading traditional methods (the PoseBusters benchmark);
Outperforms:
- physics-based tools in predicting biomolecular structures (even without template structures),
- traditional docking techniques in predicting protein–ligand interactions,
- nucleic-acid-specific predictors in protein–nucleic acid interaction accuracy,
- AF2, Rosetta, I-TASSER, and Phyre2 across key parameters (RMSD, TM-score, pLDDT confidence, and computational time);
- Forecast chemical modifications > deeper understanding of cellular processes and disease connections;
- Only 4 blocks of MSA, pair-weighted averaging > save time;
- Doesn`t require excessive specialization for different molecule types;
- Approximates complex molecular interactions with angstrom accuracy;
- Reveals functional insights (predictions correlated strongly with experimental data on protein stability and ligand binding affinities that were affected by disease-associated mutations (r = 0.89, p < 0.001));
Combined with GANs allows de novo protein design with specific functional properties (already successfully generated a series of artificial enzymes with desired catalytic activity)(Abramson et al, 2024; Malhotra et al., 2025).

The predicted structure coloured by pLDDT( estimate prediction confidence): orange, 0–50; yellow, 50–70; cyan, 70–90; and blue, 90–100)( Abramson et al., 2024)
Remaining challenges:
- Struggles with predicting how proteins behave dynamically;
- Inaccurate structures in disordered regions, orphan proteins, highly dynamic proteins, and those with significant conformational changes upon ligand binding ( for example, some enzymes have a closed conformation only when ligand-bound, but AF3 predicts such a conformation for the ligand-free state too);
- Sometimes produces inaccurate chirality, even when it’s correct in provided reference structures (4.4% violation rate);
- Produce overlapping atoms (often in protein–nucleic complexes with greater than 100 nucleotides / 2,000 residues in total);
- Generating a large number of predictions and ranking them to improve accuracy > increased computational costs;
- Predicts a single structure for a particular sequence (possible to increase the variability by modifying the MSA and using multiple seeds, but it may not help)(Abramson et al, 2024; Malhotra et al., 2025).
Get a personalized consultation from Alex Gurbych, AI Solutions Architect and CEO at Blackthorn AI, to explore how next-generation models like OpenFold3, Boltz 2, and AlphaFold 3 can accelerate your protein discovery and design processes.
Book a ConsultationBoltz 2 Overview
Accurately modeling biomolecular interactions is a critical property, but it hasn’t been enabled properly with any tools.
The Boltz team presented Boltz-2, which exhibits strong performance for both structure and affinity prediction. It’s freely available with open access to model weights, inference pipeline, and training code.

Boltz-2 architecture (Passaro et al., 2025)
Main innovations in architecture:
- Mixed-precision (bfloat16) and the trifast-4 kernel for triangle attention reduce runtime and memory use, enabling training with crop sizes up to 768 tokens.
- Boltz-2x has Boltz-steering — an inference-time method that applies physics-based potentials, improves physical plausibility (overcomes steric clashes and incorrect stereochemistry).
- Broader users’ сontrollability by integrating:
-structure prediction method conditioning,
-template conditioning and steering (integrates related complex structures or multimeric templates without retraining),
-contact and pocket conditioning (allows using specific distance or pocket constraints).
- Specialized PairFormer refinement of protein–ligand contacts with dual-head prediction (one for binding likelihood and the other for continuous affinity) trained on heterogeneous affinity labels (Passaro et al., 2025).
Performance
Boltz-2 outperformed Haiping, GAT, VincDeep, and other methods in binding affinity prediction across 140 complexes. In hit-discovery, it achieves double the average precision of ML and docking baselines, and it has better RMSF and lDDT scores compared to Boltz-1, BioEmu, and AlphaFlow in capturing local protein dynamics (Passaro et al., 2025).

Boltz-2’s performance (Boltz team)
In addition to the above, there are other strengths:
- First AI model to approach the performance of FEP methods in estimating small molecule–protein binding affinity (Pearson of 0.62—comparable to OpenFE), while being 1000x more computationally efficient;
- Data curation and representation learning > overcomes performance/compute time trade-off;
- Training data include experimental and molecular dynamics ensembles; expanded distillation datasets across diverse modalities; enhanced user control > improved binding affinity prediction (Passaro et al., 2025)
Limitations:
- Inefficient molecular dynamics (small dataset, minor architecture tweaks, limited multi-conformation handling);
- Trained on similar data as predecessors;
- Struggles with large complexes, cofactors (ions, water, or multimeric partners); may misplace parts of the ligand or generate chemically unrealistic conformations > require additional help (template of the alternate conformation or running a refinement step);
- A limited affinity crop may truncate long-range interactions or miss relevant pockets (orthosteric/allosteric);
- A comparably new tool, performance has variability across assays, and the reason is unknown (maybe structure errors, poor generalization to new protein families, or low robustness to out-of-distribution molecules) > it needs further testing (deepmirror; Passaro et al., 2025).
OpenFold3
OpenFold3(OF3) is an open-source, bitwise reproduction of AF3 with equal performance for all molecular modalities. The system was trained on more than 300,000 experimental molecular structures and a synthetic database of more than 40 million structures (Naddaf, 2025).
Like other AF3 reproductions, OF3 has a modified reference architecture to access stable training. Among the main changes: distance-bin definitions, improved normalization (has an extra step, biases are removed throughout the diffusion module), and skipped redundant MSA block.
Training has the same multi-stage design as AF3. Since the exact number of training steps wasn`t disclosed, the team empirically chose them. They also skipped nucleic-acid self-distillation (because no pretrained model was available) and used updated sequence databases for MSA construction.
Currently available preview (the completed model hasn`t been published yet) lacks full training documentation and datasets.
OF3 preview doesn`t match AF3 performance across all modalities and has difficulties with ranking accuracy.
The OF3 team described that the next steps for model improvement will be:
- training on newer PDB data;
- improving performance on weaker modalities (lDDT < 0.8);
- speeding up inference;
- expanding capabilities beyond structure prediction;
- enhancing training and inference tools (The OpenFold3 Team).
Another important thing to mention is the Federated OpenFold3 Initiative.Only about 2 % of the protein structures in public databases (AF3 and OF3 were trained on them) are paired with drugs. Five pharmaceutical companies will separately train OF3 on about 4,000 to 8,000 protein-drug pairs in their own libraries, and then Apheris will aggregate them. Despite this, the OF3 team warns not to expect huge changes in drug discovery, as it`s only a starting point (Saey, 2025).
OpenFold3 vs Boltz 2 vs AlphaFold3: Head-to-Head Comparison
| AlphaFold3 | Boltz-2 | OpenFold3 | |
| Prediction capabilities | – Structure of proteins, protein complexes (with another protein, nucleic acid, ligand/ion); – Post-translational modification; – High accuracy (especially for large complexes and multimeric assemblies). | – Structure of proteins, protein complexes (with protein, nucleic acid, ligand); – Binding affinity for protein-ligand interaction; – Approaches FEP (even 1000x faster); – High physical validity. | The same as AF3 |
| Training dataset | The exact dataset is unavailable. Reported sources:PDB (all structures up to 2021), Rfam (RNA), JASPAR/SELEX (protein–DNA), VDJdb (TCR–pMHC), IEDB (MHC–peptide epitopes), AlphaFold DB (a monomer distillation of about 5M proteins) | Experimental:All PDB entries released before 2023 (without complexes over 7MB or with more than 5000 residues). Binding affinity: MISATO(11,235 systems), ATLAS(1,284 proteins – 100 frames each), mdCATH(5,270 systems). Distillation data:RNA, protein-DNA/ligand, TCR-pMHC (explicit amount isn’t provided);RNA-ligand (filtered 2500 examples);MHC (up to 100 sequences per allele for class I and up to 200 per allele pair for class II);AlphaFold DB (the same as AF3). | 300k experimental molecular structures >40M synthetic structures∼ 20k protein-drug pairs will be added |
| Memory efficiency (maximum per GPU) | > 5000 residues | ∼ 2400 residues | Not described yet |
| Framework | JAX:- Sophisticated automatic differentiation capabilities; – Beneficial for large-scale tensor operations; – Difficult for inexperienced users, less flexible. | PyTorch: – High accessibility, community support, and pre-built modules (can accelerate research); – Easier to debug and modify models; – Scalability across multiple GPUs; – Integration into existing workflows, creation of complex multi-component systems. | |
| Licensing | Creative Commons license – Prohibited commercial use without a license; – Some functionalities (ligand binding, certain kinds of modifications) aren’t fully available in the public version. | Permissive license – Allowed modification of the model; – Open commercial/academic usage and all functions are fully available. | |
| Weights and the full training pipeline/data | Under restricted access | Available | Will be available |
Comparison of AF3, Boltz2, and OpenFold (Google DeepMind (GitHub); OpenFold Consortium: OpenFold; Falk Hoffmann: Boltz-2 revolutionises drug discovery; Genophore : OpenFold vs AlphaFold2; NVIDIA: OpenFold/OpenFold2; Abramson et al., 2024; Passaro et al., 2025; The OpenFold3 Team; Naddaf, 2025; Saey, 2025)
Models experimental comparison
The evaluation will be of the most relevant and available models for multimeric protein structure prediction – Boltz-2, OF3, and AlphaFold Multimer(AFM).
To compare them, 10 structures (unseen by these models in the training sets) were used. Predictions were performed for oligomers, and then evaluated by the accuracy of global structure (ipTM, TM), interface geometry (DockQ, Inter-chain iDDT, iRMSD), and inter-chain contacts (Jaccard index, Precision, Recall).

Aligned AFM (cyan), Boltz-2 (blue), OF3(green) predictions to native structure (grey)
Metrics that were used:
- DockQ – evaluation of overall docking accuracy ( < 0.23 – incorrect; 0.23 – 0.49 – acceptable;0.49 – 0.8 – medium; ≥ 0.8 high quality)(Basu et al., 2016).
- iRMSD – interface atom deviation ( > 4 Å low quality; 2–4 Å – acceptable; <2 Å – high quality)(Armougom et al., 2006).
- ipTM – predicted measure of interface similarity ( < 0.6 likely to fail; 0.6 – 0.8 could be correct or wrong; > 0.8 confidence in high-quality prediction) (EMBL-EBI).
- TM-score – global structure similarity ( < 0.2 random proteins; 0.2-0.5 likely not in the same fold; > 0.5 assume roughly the same fold) (Zhang et al., 2005).
- Inter-chain iDDT – per-residue inter-chain local distance difference (< 0.25 – incorrect interface; 0.25 – 0.5 – low quality; > 0.5 roughly correct) (Mariani et al., 2013).
- Jaccard index – overlap between predicted and native inter-chain contacts ( 0 – no shared contacts; 1 – perfect contact overlap).
- Precision – the proportion of predicted contacts that are correct (0 – 1 – higher value means fewer false positive contacts).
- Recall – the proportion of true native contacts that were recovered (0 – 1 – higher value means more true contacts captured).
AlphaFold Multimer


Boltz-2


OpenFold3


It’s clear that all models have high-level performance. However, in general, scores of AFM and OF3 were better and less variable, indicating consistency. Boltz-2 variability might be explained by the structure sensitivity reported in the available evaluations of this model.
| Score | AFM | Boltz-2 | OF3 | Better model |
|---|---|---|---|---|
| (mean ± SD) | ||||
| DockQ | 0.57 ± 0.32 | 0.58 ± 0.37 | 0.61 ± 0.35 | Equal |
| iRMSD | 3.54 ± 4.06 | 4.98 ± 7.14 | 3.86 ± 5.13 | AFM |
| ipTM | 0.73 ± 0.20 | 0.67 ± 0.23 | 0.69 ± 0.13 | AFM |
| TM-score | 0.89 ± 0.07 | 0.82 ± 0.17 | 0.90 ± 0.07 | AFM/OF3 |
| Inter-chain iDDT | 0.57 ± 0.32 | 0.54 ± 0.36 | 0.60 ± 0.38 | OF3 |
| Jaccard index | 0.46 ± 0.28 | 0.41 ± 0.34 | 0.51 ± 0.32 | AFM/OF3 |
| Precision | 0.55 ± 0.30 | 0.48 ± 0.34 | 0.57 ± 0.34 | AFM/OF3 |
| Recall | 0.60 ± 0.35 | 0.55 ± 0.38 | 0.64 ± 0.39 | OF3 |
Models have shown high structural accuracy, while having problems with interface geometry and recovering native inter-chain contacts. They not only miss a large number of true contacts but also predict many false positives (especially Boltz-2).
On the other hand, Boltz-2 has a significantly lower level of clashes (aligns with its authors’ claim that this model has better physical validity) – only 10% with 1 clash per prediction. Surprisingly, OF3`s predictions have none, while 50% of AFM’s predictions had 1 – 13 clashes.
Proportion of high-scored predictions
| Score(with a threshold of high accuracy) | Percentage of predictions | Better model | ||
| AFM | Boltz-2 | OF3 | ||
| DockQ (≥ 0.8) | 30% | 50% | 60% | OF3 |
| iRMSD ( < 2 Å ) | 70% | 70% | 70% | Equal |
| ipTM (≥ 0.8) | 60% | 50% | 20% | AFM |
| TM-score (≥ 0.85) | 90% | 60% | 80% | AFM |
| Inter-chain iDDT (≥ 0.7) | 60% | 40% | 70% | OF3 |
| Jaccard index (≥ 0.5) | 70% | 50% | 70% | AFM/OF3 |
| Precision (≥ 0.5) | 70% | 50% | 70% | AFM/OF3 |
| Recall (≥ 0.5) | 70% | 60% | 70% | AFM/OF3 |
Such a distribution additionally confirms that AFM and OF3 are less variable. AFM’s percentage of high DockQ might be surprising, but it’s important to take into account that its scores didn’t achieve that high level as Boltz-2, but the last model also has more samples with low DockQ scores. Thus, this score also follows a general pattern. The same situation with OF3`s ipTM scores.
To summarise, AFM and OF3 have more stable and better performance over all measured scores, but AFM is inferior to Boltz-2 and OF3 in handling steric clashes. Additionally, sometimes Boltz-2 has higher accuracy than compared models, but this is offset by a higher level of low-quality predictions.
Conclusion
AI has already become an essential part of modern protein engineering. Without such tools for protein structure prediction, the field remains constrained by trial-and-error, inefficiency, and limited precision.
AF’s models became a standard, focusing on accuracy, generalization, and consistency across diverse biomolecules. Though they have more restrictive licensing and less flexible functional capabilities.
Boltz-2 prioritizes speed, users’ control, accessibility, and physical accuracy. It enables prediction of structures and binding affinities, modeling local MD, and achieves FEP-level accuracy.
OF3 is a prospective and accessible approach that might achieve AF3 level of performance, and through cooperation of pharmaceutical companies, might have a significant influence on drug discovery. But still, it`s too early to evaluate this model, as it hasn`t been fully released.
Thus, there is no answer to which model is the best. Users need to decide on their own which one can be more useful in a particular task.
Nevertheless, there remain plenty of limitations in performance and adoption (particularly within industry), and future progress depends not only on improving individual models but also on integrating their complementary strengths into more accessible, versatile platforms for broader use.