Boltz-2 Does Not Just Predict Protein Structures. It Predicts Whether Your Drug Will Work.

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 21, 2025

AlphaFold solved protein folding. Every lab celebrated. Then drug discovery teams realized the real blocker was never the structure, it was the binding affinity. Knowing where a molecule docks tells you almost nothing about whether it actually works as a drug. Boltz-2 attacks that second problem head-on, and it does it 1,000x faster than the physics-based methods that actually had accuracy.

This newsletter dissects the Boltz family (Boltz-1 and Boltz-2), focusing on the architectural decisions that make affinity prediction possible, the data engineering behind millions of noisy biochemical assay measurements, and why the open MIT license on both models plus training code matters as much as the benchmarks.

What this covers: architecture of Boltz-1 and Boltz-2, the affinity module design, training pipeline decisions, benchmark results on FEP+, CASP16, and MF-PCBA. What this excludes: AlphaFold3 internals beyond comparison, wet-lab validation protocols, and the SynFlowNet generative model details beyond what's architecturally relevant.

What It Actually Does

Boltz-1 (released November 2024, bioRxiv) was the first fully open-source model to match AlphaFold3 accuracy on arbitrary biomolecular complex structure prediction. 4k GitHub stars, 828 forks, MIT license including weights and training code.

Boltz-2 (released June 2025, bioRxiv) goes further. It adds a dedicated affinity module on top of the co-folding trunk, trained on a curated hybrid dataset of over 1.2M continuous binding measurements (Ki, Kd, IC50) from ChEMBL, BindingDB, and PubChem, plus 2M+ binary binder/decoy labels from high-throughput screening (HTS) assays.

The headline result: on the FEP+ 4-target benchmark (CDK2, TYK2, JNK1, P38), Boltz-2 achieves Pearson R = 0.66, approaching state-of-the-art free-energy perturbation (FEP) methods (R = 0.78) while running over 1,000x faster. On the CASP16 affinity challenge (a blind benchmark, 140 protein-ligand pairs), Boltz-2 outperforms every submitted competition entry out of the box, without fine-tuning.

Structure prediction accuracy is also improved. Boltz-2 matches or beats Boltz-1 across modalities, with notable gains on RNA and DNA-protein complexes from expanded distillation datasets, and hits 84.8% success rate (less than 2Å RMSD) on the Polaris-ASAP ligand pose competition without any fine-tuning, beating all top-10 finishers.

The Architecture, Unpacked

Boltz-2 is a four-module system. Every module has a specific job and feeds the next. The key design insight is that binding affinity prediction is not a separate problem from structure prediction. It is a downstream readout of the latent representations the structure model learns during co-folding.

The critical path is Trunk → Denoising → Affinity. The trunk runs once; all downstream modules consume its cached representations. The affinity module operates only on protein-ligand and intra-ligand pairs, discarding 80%+ of pair representations to reduce memory by 5x.

Key Architecture Decisions

Trunk depth increase: Boltz-2 increases PairFormer layers from 48 (Boltz-1) to 64, trains in bfloat16 with custom trifast kernels for triangular attention, and scales crop size from 512 to 768 tokens. This is the same crop size as AlphaFold3 and is the prerequisite for structural accuracy at larger complexes.

Affinity module isolation: The affinity module's gradients are detached from the trunk. Boltz-2's affinity training does not backpropagate into the structural backbone. This is deliberate: the trunk is trained first on structure, and the affinity module learns to read the latent representations without corrupting them. This also means you can swap affinity heads without retraining the billion-parameter backbone.

Confidence model simplification: Boltz-1 used a full trunk-sized confidence model (48 PairFormer layers, initialized from structure trunk weights) which was expensive. Boltz-2 uses only 8 PairFormer layers, closer to AlphaFold3's 4-layer design, accepting a small accuracy tradeoff for significant speedup. The confidence model splits PAE and PDE logit heads by same-chain vs. cross-chain pairs, which the authors report as beneficial.

Ensemble as reward stabilizer: Two affinity models are trained with different hyperparameters (λ_focal = 0.8 vs. 0.6, 8 vs. 4 PairFormer layers, different training durations). The second model exists specifically to prevent reward hacking when the system is used with SynFlowNet molecular generation. An ensemble that disagrees is a safety net against the generator exploiting a single model's blind spots.

The Code, Annotated

Snippet One: The Affinity Module Forward Pass

# From Boltz-2 paper Algorithm 1 (Appendix B.5), reconstructed to be runnable

import torch
import torch.nn as nn

class AffinityModule(nn.Module):
    def __init__(self, pair_dim=128, num_pairformer_layers=8, hidden=256):
        super().__init__()
        # Operates ONLY on z_trunk pair representations, not the full trunk
        # This is the core design choice: affinity is a readout of structural latents
        self.input_proj = nn.Linear(pair_dim, pair_dim, bias=False)
        self.input_proj_s = nn.Linear(pair_dim, pair_dim, bias=False)
        self.pairwise_conditioner = nn.Linear(pair_dim, pair_dim, bias=False)
        
        # ← THIS is the trick: mask to protein-ligand + intra-ligand pairs only
        # This discards >80% of z_trunk by zeroing out intra-protein interactions
        # Reduces memory >5x versus operating on the full NxN pair matrix
        self.pairformer = PairFormerStack(pair_dim, num_layers=num_pairformer_layers)
        
        # Mean pool over the surviving pairs to get scalar representation
        # No attention-based pooling. Simple mean is sufficient here.
        self.binding_head = nn.Sequential(
            nn.Linear(pair_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 2)   # binary logits
        )
        self.affinity_head = nn.Sequential(
            nn.Linear(pair_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 1)   # continuous log10(IC50) in µM
        )

    def forward(self, z_trunk, s_inputs, distogram, protein_ligand_mask, intra_ligand_mask):
        # z_trunk: pair representations from 5 recycling steps (critical: 5 not 3)
        z = self.input_proj(z_trunk.layer_norm())
        
        # Outer product of single representations to get pairwise context
        z = z + self.input_proj_s(s_inputs[:, :, None]) + \
                self.input_proj_s(s_inputs[:, None, :])
        
        # Distogram (discretized pairwise distance from predicted coordinates)
        # feeds structural information into the affinity computation
        z = z + self.pairwise_conditioner(distogram.one_hot())
        
        # Run PairFormer with combined mask: only protein-ligand + intra-ligand
        combined_mask = protein_ligand_mask + intra_ligand_mask
        z = z + self.pairformer(z, pair_mask=combined_mask)  # ← masked attention
        
        # Mean pool over surviving interface pairs
        # Off-diagonal: protein-ligand interactions dominate
        identity = torch.eye(z.shape[-2], z.shape[-1], device=z.device)
        pool_mask = combined_mask * (1 - identity)
        g = z[pool_mask.bool()].mean(dim=0)   # scalar interface representation
        
        g = torch.relu(g.linear(g)).relu()
        
        binding_likelihood = torch.softmax(self.binding_head(g), dim=-1)
        affinity_value = self.affinity_head(g)  # Output: log10(IC50) in µM
        
        return binding_likelihood, affinity_value

The affinity module's masked PairFormer is the architectural bet of Boltz-2. By attending only to protein-ligand and intra-ligand pairs, it focuses the entire computation on what matters for binding and achieves 5x memory reduction with no accuracy cost.

Snippet Two: Boltz-1 Kabsch Diffusion Fix

This is the most important and least-discussed algorithmic change in Boltz-1 versus AlphaFold3. AlphaFold3 uses rigid-aligned MSE loss during training (aligning predicted to ground truth before computing loss). Boltz-1's team identified a theoretical failure mode: a model can achieve zero aligned MSE loss but fail at inference time.

# Simplified reconstruction of Kabsch interpolation (Boltz-1 paper Section 3.2)
import numpy as np

def kabsch_align(P: np.ndarray, Q: np.ndarray) -> np.ndarray:
    """Align P onto Q using Kabsch algorithm. Returns rotated P."""
    # Center both point clouds
    P_c = P - P.mean(axis=0)
    Q_c = Q - Q.mean(axis=0)
    
    # Compute optimal rotation via SVD
    H = P_c.T @ Q_c
    U, S, Vt = np.linalg.svd(H)
    
    # Handle reflection (ensure proper rotation, not reflection)
    d = np.linalg.det(Vt.T @ U.T)
    D = np.diag([1, 1, d])     # ← THIS handles chirality flip edge case
    
    R = Vt.T @ D @ U.T
    return (P_c @ R.T) + Q_c.mean(axis=0)

def boltz1_reverse_diffusion_step(
    x_noisy: np.ndarray,          # Current noisy coordinates (randomly rotated)
    x_denoised_pred: np.ndarray,  # Model's denoised prediction (in model frame)
    alpha_t: float,               # Interpolation weight for this timestep
) -> np.ndarray:
    """
    AlphaFold3 would interpolate directly between x_noisy and x_denoised_pred.
    Problem: x_noisy is rotated; x_denoised_pred is in a different frame.
    Direct interpolation yields a structure that is NEITHER noisy NOR denoised.
    
    Boltz-1 fix: Kabsch-align x_denoised_pred to x_noisy BEFORE interpolation.
    This guarantees the interpolated structure stays close to the denoised sample.
    """
    # ← THIS is the trick: align the denoised prediction into the noisy frame
    x_denoised_aligned = kabsch_align(x_denoised_pred, x_noisy)
    
    # Now interpolate in a consistent coordinate frame
    x_next = (1 - alpha_t) * x_noisy + alpha_t * x_denoised_aligned
    
    return x_next

# The failure mode Boltz-1 avoids:
# AF3 model can memorize f(x_noisy) = x_true for ANY rotation of x_noisy
# and still get zero aligned-MSE loss during training.
# At inference, the interpolation between rotated x_noisy and unrotated x_denoised
# produces garbage structures that fall out of distribution.
# Kabsch alignment at each step prevents this accumulation of frame mismatch.

The Kabsch interpolation fix is theoretically necessary but practically small for the final Boltz-1 model, since the model converges to near-projection denoising anyway. Its biggest impact is on smaller models or data-limited training, where overfitting risk is higher.

It In Action: Predicting Binding Affinity for TYK2 Inhibitors

Target: TYK2 kinase (tyrosine kinase 2), a validated drug target for autoimmune diseases. The benchmark uses the protein-ligand benchmark from [Hahn et al., 2022].

Input:

# boltz predict input.yaml
version: 1
sequences:
  - protein:
      id: TYK2
      sequence: "MAQAAPL...KVIFQ"   # TYK2 JH2 domain, ~300 AA
  - ligand:
      id: LIG
      smiles: "CC1=CC2=C(C=C1)N=C(N2)NC(=O)C3=CC=C(C=C3)F"  # Example analog
affinity: true

Step 1: Structure Prediction (Trunk + Denoising)

boltz predict input.yaml --use_msa_server
# Runtime: ~45 seconds on A100 80GB
# Generates 5 diffusion samples, 200 diffusion steps each
# 5 recycling iterations (affinity module requires 5, not 3)

MSA is fetched from ColabFold server (MMseqs2 search against UniRef30 + ColabFold EnvDB)
Trunk runs once, outputs z_trunk (pair repr.) and s_trunk (single repr.)
Denoising runs 200 steps per sample x 5 samples = 1000 forward passes through the denoising transformer
Attention bias is cached and shared across all 1000 passes (the key compute optimization from Boltz-1 Section 3.4)

Step 2: Ranking (Confidence Module)

The confidence module scores each of the 5 samples with ipTM (interface predicted TM-score). The top-ranked structure is selected as input to the affinity module.

Sample 1: ipTM = 0.81  ← selected
Sample 2: ipTM = 0.79
Sample 3: ipTM = 0.74
Sample 4: ipTM = 0.77
Sample 5: ipTM = 0.72

Step 3: Affinity Prediction (Affinity Module)

The affinity module processes only the protein-ligand interface of the selected structure.

{
  "affinity_pred_value": -1.43,      // log10(IC50 in µM) → IC50 ≈ 37 µM (weak binder)
  "affinity_probability_binary": 0.31 // 31% probability of being a true binder
}

Interpretation:

affinity_probability_binary is calibrated for hit discovery: above ~0.5 = likely binder
affinity_pred_value is calibrated for lead optimization: use to rank analogs within a series
These two heads are trained on largely different datasets with different supervision signals. Never conflate them.

Speed comparison:

Boltz-2: ~45 seconds per compound on one A100
Physics-based FEP (FEP+): ~8-48 hours of simulation per compound
Speedup: >1,000x

Accuracy: On a 10-compound Enamine Kinase Library screen validated with absolute FEP (ABFE), all 10 compounds Boltz-2 selected were predicted binders (ΔG < -5.45 kcal/mol) by ABFE. The Boltz-2 screen score correlation with ABFE readout: |R| = 0.74.

Why This Design Works, and What It Trades Away

Why it works:

The structural latent representations learned by the co-folding trunk encode binding site geometry, residue environment, and physical contact information. Using these directly as input to the affinity module is not a shortcut. It is the correct information bottleneck. Traditional docking scores atom positions without understanding the protein's learned representation. Boltz-2's affinity module is reading a compressed, high-quality description of molecular interaction.

Gradient detachment during affinity training preserves the structural representations intact. If you backpropagated affinity loss into the trunk, you would degrade structure quality to squeeze out affinity performance, creating a model that is mediocre at both.

The two-head design (binary binding likelihood + continuous affinity value) addresses a real experimental reality: different stages of drug discovery need different signals. HTS campaigns need binder/non-binder discrimination at scale. Lead optimization needs precise ranking of closely related analogs. Conflating these into one head would satisfy neither use case.

What it trades away:

Accuracy on GPCRs and other "difficult" protein families is limited. FEP also fails here without custom input preparation. This is not unique to Boltz-2, but it means the model cannot yet replace FEP for all targets.

The affinity module does not handle cofactors (ions, water molecules, multimeric binding partners). If the binding depends on a metal ion in the active site, Boltz-2 cannot model that dependency correctly in its current form.

The model requires structural prediction quality as a prerequisite. If the trunk predicts an incorrect binding pocket, the affinity module's outputs are unreliable. There is no way to know this without a separate structural quality check.

Training on mixed assay types (Ki, Kd, IC50) is a conscious tradeoff. These values are related through the Cheng-Prusoff equation but not identical. The model learns to rank within assays more reliably than to predict absolute cross-assay values. This is fine for lead optimization (you are always comparing analogs in the same assay), but limits interpretability of the raw number.

Technical Moats

The data engineering problem is the moat, not the model. Boltz-2 trained on 1.2M continuous affinity measurements from ChEMBL and BindingDB, plus over 2M binary labels from PubChem HTS. Getting there required: PAINS filter application, assay-level quality filtering (minimum IQR threshold, minimum unique values, maximum Tanimoto similarity within assay for hit-to-lead), structural quality filtering via ipTM thresholds, synthetic decoy generation with Tanimoto < 0.3 to known binders, and cross-assay normalization to log10(µM). Any team trying to replicate this has to re-solve the data curation problem, which is a months-long effort of domain expertise plus engineering.

Cascade training order cannot be skipped. You cannot train the affinity module without a high-quality structure module producing the z_trunk inputs. The affinity module quality directly inherits from the structural accuracy. This creates a natural sequencing requirement: structure training (88k+ steps across 4 stages) → confidence training → affinity training. Replicating this end-to-end requires large compute budgets (128 A100 GPUs for affinity training alone) and accumulated training time.

The open license is a moat for the community, not against it. MIT license on weights, training code, and data encourages fine-tuning, benchmarking, and integration into existing pipelines. This creates a growing ecosystem around Boltz that makes the model harder to compete against, not because of IP protection, but because of community adoption and downstream tooling (Tenstorrent hardware port, HuggingFace integrations, the 4k GitHub star community signal).

Insights

Insight One: The "AlphaFold3 solved drug discovery" narrative is wrong, and Boltz-2 is the evidence.

AlphaFold3 predicts where a ligand sits. That is necessary but not sufficient for drug discovery. The actual question is: by how much does changing one atom on the ligand change the binding strength? That requires affinity prediction, not structure prediction. Boltz-1 matched AlphaFold3 structure accuracy in November 2024. Seven months later, Boltz-2 shows that structure accuracy alone was not the bottleneck. The community celebrated AlphaFold3 as if the problem was solved. Boltz-2 demonstrates the real problem was a different one entirely, and it had gone largely unaddressed in the open-source world.

Insight Two: The Boltz-2 confidence model downgrade is the correct architectural decision, even though the community will read it as a step backward.

Boltz-1 used a massive confidence model (full trunk-sized, 48 PairFormer layers, initialized from structure trunk weights, with diffusion trajectory activations fed in). It was expensive but accurate. Boltz-2 cuts this to 8 PairFormer layers, a 6x reduction. The naive reading is: they sacrificed confidence accuracy for speed. The correct reading: the large confidence model's marginal accuracy gain was not worth the compute cost at inference. Boltz-2's simpler confidence model still produces ipTM scores reliable enough to rank 5 diffusion samples correctly, which is all the affinity module needs. The right question is always "accurate enough for the downstream task," not "most accurate in isolation."

Takeaway

The Kabsch alignment fix in Boltz-1's reverse diffusion is theoretically necessary but empirically irrelevant for large models. The Boltz-1 paper proves a model can achieve zero aligned-MSE loss during training while failing completely at inference, due to frame mismatch during interpolation. They fix this with Kabsch alignment at each diffusion step. Then, in the ablation section, they note the fully trained Boltz-1 model already denoises close to the Kabsch projection, making the fix redundant at scale. The fix matters for smaller models and data-limited settings, and is a correct theoretical contribution, but the final model would likely perform nearly the same without it. The paper reports it honestly. Most readers won't notice the nuance.

TL;DR For Engineers

Boltz-2 achieves Pearson R = 0.66 on FEP+ affinity benchmark, approaching FEP (R = 0.78), at 1,000x less compute. That is the headline number that matters.
The affinity module attends only to protein-ligand and intra-ligand pairs, discarding all intra-protein interactions. This is both the architectural bet and the 5x memory optimization.
Affinity training is gradient-detached from the trunk. The structural representations are read, not updated, during affinity learning.
Two separate outputs: affinity_probability_binary for hit discovery (binder vs. non-binder), affinity_pred_value for lead optimization (ranking analogs). They are trained on different data with different supervision. Never use one for the other's purpose.
MIT license includes weights AND training code. This is the second fully open model family to match proprietary accuracy at commercial scale. The first was Boltz-1 itself.

The FEP Replacement Is Not Coming. The FEP Supplement Already Arrived.

Boltz-2 will not replace free-energy perturbation for targets where FEP works well and you have weeks of simulation budget. What it does is expand the decision surface. You can now screen 460k compounds (Enamine HLL) in hours, rank the top candidates with 1,000x faster estimates, and hand off only the top 10 to FEP validation. That workflow changes drug discovery economics, not because it is more accurate than FEP on any individual pair, but because it makes accuracy-at-scale tractable for the first time. An open-source model that does this, with published training code, is a different kind of contribution than a paper about a closed proprietary system. Boltz-2 is not a claim. It is a usable tool.

References

Boltz-1: Democratizing Biomolecular Interaction Modeling — Wohlwend et al., 2024, bioRxiv
Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction — Passaro, Corso, Wohlwend et al., 2025, bioRxiv
Boltz GitHub Repository — MIT license, weights + training code
AlphaFold3 — Abramson et al., Nature 2024 (primary comparison baseline)
DiffDock — Corso et al., 2022 (precursor diffusion docking approach)
ColabFold — Mirdita et al., Nature Methods 2022 (MSA server used by Boltz)
PoseBusters — Buttenschoen et al., Chemical Science 2024 (physical validity benchmarks)
SynFlowNet — Cretu et al., 2024 (GFlowNet molecular generator coupled to Boltz-2)
FEP+ Benchmark — Ross et al., 2023 (primary affinity benchmark for lead optimization)

Boltz is an open-source family of biomolecular interaction models (Boltz-1 and Boltz-2, MIT license) from MIT CSAIL. The key architectural insight in Boltz-2 is that binding affinity is a downstream readout of the latent representations learned during co-folding, captured by a gradient-detached affinity module that operates only on protein-ligand interface pairs. Boltz-2 achieves Pearson R = 0.66 on the FEP+ benchmark at 1,000x the speed of physics-based FEP, making accurate in silico affinity screening practical for early-stage drug discovery.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Keep up with tech in 5 minutes

TLDR is the free daily email with summaries of the most interesting stories in startups, tech, and programming. The stuff worth knowing, minus the doomscrolling.

Issues are curated by ex-Google and Anthropic engineers and land in your inbox before your morning coffee. A 5-minute read, and you walk into the day already knowing what your team is still catching up on.

Tech is just the start. We also cover AI, marketing, dev, and more. Pick the briefs that match your work.

Free, daily, and read by 7M+ subscribers. Subscribe and let the experts do the digging for the tech news that matters.

Subscribe for Free