Reading variant pathogenicity out of a protein language model
When a clinical lab finds a single-letter change in a patient's protein, the first question is brutally simple and often unanswerable: does this matter? This study asks whether a general-purpose protein language model — one never trained on a single clinical label — already carries the answer in its internal representation, and whether we can read it out with probes simple enough to trust.
Introduction
A missense variant swaps one amino acid for another at a single position in a protein. Some of these changes are harmless; others break the protein and cause disease. Telling the two apart — variant pathogenicity prediction — is one of the central unsolved problems in clinical genetics. Every year sequencing turns up millions of variants that have never been seen before, and the overwhelming majority are filed away as Variants of Uncertain Significance (VUS): real changes in a real patient's genome that no one can yet call benign or pathogenic. A VUS is a dead end for clinical action.
The bet of this work is that the information needed to resolve many of these variants is already latent in models that learned nothing but protein sequences. Protein language models (PLMs) are trained, like text language models, to fill in masked tokens — except the "tokens" are amino acids and the "sentences" are the hundreds of millions of natural protein sequences produced by evolution. To predict a masked residue well, the model has to implicitly learn what evolution permits at each position: which substitutions proteins tolerate, which they never make, and which would break a fold, a binding site, or a catalytic mechanism. That implicit grammar of "what a working protein looks like" is exactly the signal a pathogenicity predictor needs.
- ESM2 — the previous-generation Evolutionary Scale Model (up to 650M parameters in the variant used here), a transformer trained by masked-language-modelling over UniRef protein sequences. It is our strong open baseline.
- ESM-C 6B ("ESMC") — a newer, 6-billion-parameter model in the same family, with substantially richer per-residue representations. It is the model at the centre of this study.
We freeze ESMC entirely — no fine-tuning — and train lightweight linear probes on its residue embeddings. The discipline of a linear probe is the whole point: a linear read-out can only succeed if the information is already, linearly present in the embedding. If a simple probe works, the credit belongs to the representation, not to a clever classifier. The atlas under the Map tab places every ClinVar variant by its ESMC layer-78 embedding and colours it by the probe's prediction.
Two threads run through the work:
- Pathogenicity prediction. ESMC + a supervised probe (
P_PB) is a strong, honestly-validated variant-effect predictor — beating EVEE and the previous-generation ESM2, and competitive with the specialised supervised model AlphaMissense, even though it is trained on only ~39k labelled variants. - Mechanistic interpretation. Beyond a single pathogenicity number, we train a battery of per-residue annotation probes (structure, solvent accessibility, functional sites, topology) and use them to mechanistically explain why a variant is disruptive — and then have GPT-5.5 synthesise that channel-level profile into a readable account. The goal is a per-residue, feature-level story of variant effect, not one opaque score.
Background
The problem: ClinVar and the VUS backlog
ClinVar is the public archive where clinical labs and researchers deposit observed human genetic variants together with their interpreted clinical significance — Pathogenic, Likely pathogenic, Benign, Likely benign, or Uncertain significance (VUS) — along with the evidence behind each call. It is the closest thing the field has to ground truth, and it is the source of every clinical label in this study.
But ClinVar grows by accretion, and the uncertain pile grows fastest. A large fraction of missense entries are VUS: the variant is real and observed in patients, but the evidence is too thin to classify. Each unresolved VUS is a patient and a clinician left without an answer. The practical promise of a good computational predictor is to triage that backlog — to flag the VUS most likely to be pathogenic for follow-up, and to reassure on those most likely benign.
Why this is hard to benchmark honestly
The central hazard is circularity / leakage. Most variants you would test a model on are already in ClinVar, so a model trained on today's labels — or whose pretraining quietly absorbed them — is partly recalling the answer rather than predicting it. A headline AUROC can then measure memorisation, not clinical foresight. Everything in the Methods and Temporal sections below is built to defeat this: homology-aware splits so train and test never share a protein family, and a temporal-cutoff design that borrows labels from the future to grade predictions made on what were, at the time, genuine unknowns.
Data
All labels and features are assembled over 2,014 human proteins with AlphaFold structures and rich UniProt annotation. The headline numbers:
- Clinical labels — ClinVar. ~38,960 Pathogenic/Benign missense variants for supervised training; 62,727 clinically-annotated variants in total (including VUS), and prediction scores extended to ~416,000 variants across the proteome.
- Temporal snapshots. ClinVar frozen at 2021-06, 2023-06, 2024-06, with the former-VUS that were later reclassified used as a leakage-free prospective test set (2,897 / 2,420 / 1,682 resolved variants respectively).
- Structural / biophysical features. Per-residue pLDDT (AlphaFold), relative solvent accessibility and secondary structure (DSSP), BLOSUM62 substitution score.
- Functional annotation. UniProt active/binding/metal sites, disulfides, transmembrane spans, signal peptides, PTMs, domains — the label sources for the annotation-probe battery.
- External baselines. AlphaMissense, EVEE, ESM-1v / ESM2 zero-shot likelihoods, and MSA-based conservation, all evaluated on the identical splits.
Methods & probe architectures
Representations
We extract frozen per-residue embeddings from ESM-C 6B layer 78 (2,560-d per residue, run locally — no API). Layer 78 is not arbitrary: in a per-layer sweep, linear pathogenicity separability rises through the network and peaks in the deep layers around 78, so that is the layer we read every probe off. Each variant is summarised by its mutated-site embedding, and we also test mean and covariance ("cov-pool") pooling over the whole sequence. The WT→mutant delta embedding feeds the disruption analysis.
Probes are deliberately simple:
StandardScaler → LogisticRegression(class_weight="balanced") for classification and
ridge regression for continuous targets. Simplicity is the discipline — a linear read-out can only
succeed if the information is already linearly present in the embedding.
Two probe families
We train two architecturally distinct kinds of probe, which answer two different questions.
A · Variant pathogenicity probes
These answer the clinical question directly: is this variant pathogenic? The strongest
variant is covariance pooling (logcov) — instead of reading a single residue,
we summarise the whole-sequence WT→mutant change as a covariance ("how the embedding's
internal correlations shift"), and probe that. The single-site supervised probe we ship in the
viewer is P_PB (the score colouring the atlas). Cross-validated AUROC reaches
~0.94. Because it pools across the sequence, it is a predictor, not an explanation.
B · Per-residue annotation probes
These don't ask "is it pathogenic" — they decode an interpretable property of a position, so we can later explain why a variant matters. There are two sub-types:
- [1] Wild-type probes. These read a property off a single protein sequence as it is: "Is residue 47 in a helix?" "Is position 112 buried?" "Is this site evolutionarily conserved?" The input is the embedding of the wild-type protein. These build the static, per-residue annotation tracks.
- [2] Disruption / variant probes. These ask what changes when you mutate a residue. We embed the wild-type protein, embed the mutant, and probe the difference vector — the shift in each decoded property (burial, secondary structure, conservation, contacts, site membership…). This decomposition is what turns one pathogenicity number into a channel-by-channel mechanistic profile.
Why covariance pooling, and why ESMC
How you pool the per-residue embedding matters as much as which model you use. Covariance pooling beats simple mean pooling, and ESM-C beats the comparable ESM2 at every pooling scheme — the gap widens as the pooling gets cruder, which is exactly the signature of a richer representation.
Honest evaluation
- Homology-aware splits. MMseqs2 id30 clustering, protein-grouped and UniRef50-cluster cross-validation, so train and test never share a protein family (no homology inflation).
- Baseline comparisons for every score. Every pathogenicity prediction is shown against its baselines (AlphaMissense, EVEE, ESM2, zero-shot likelihoods, MSA, structure-only), and every annotation probe is scored by its lift over the best model-free baseline rather than raw accuracy — the honest measure of what the model adds. Throughout, results are shown as horizontal-histogram comparisons with 95% bootstrap CIs.
- Leakage-free temporal test. The strongest validation (see Temporal Analyses): retrain on what was known at an old cutoff, then predict the then-unknowns.
Mechanistic explanations with GPT-5.5
The disruption probes produce a per-variant channel profile (which properties shift, and by how much). On its own that is a bar chart; to make it readable we feed the profile to GPT-5.5, which synthesises a short, grounded mechanistic explanation of why the variant is disruptive — strictly constrained to the channels actually present in the data. This is wired live into the viewer; see Mechanistic profiles & GPT-5.5 below.
Pathogenicity prediction
The ESMC supervised probe is a strong variant-effect predictor — on the general benchmark it tops every method, including the specialised AlphaMissense, and on the leak-free temporal test it holds up.
Data & method
We train on the ~38,960 ClinVar variants confidently labelled Pathogenic or Benign,
extract their ESMC L78 representations, and fit the simple linear probes described above —
single-site P_PB and whole-sequence covariance pooling. That is a small
training set by deep-learning standards, which makes the result below notable: a frozen model plus
a linear read-out, on ~39k labels, competes with purpose-built supervised variant-effect models.
Validation is gene-holdout cross-validation here, and the stricter leakage-free temporal protocol
in the next section.
Figure 1 · General pathogenicity benchmark
Gene-holdout AUROC on the full ClinVar pathogenic/benign set, on a homology-disjoint test set (95% bootstrap CIs). ESM-C covariance pooling at layer 78 is the top method (0.944), edging AlphaMissense (0.943) and Evo 2 covariance (0.940); the previous-generation ESM2, the zero-shot likelihoods, MSA and structure-only baselines trail clearly.
| Method | AUROC (95% CI) | |
|---|---|---|
| ESM-C cov | 0.9441 [0.9403–0.9479] | |
| AlphaMissense | 0.9429 [0.9390–0.9468] | |
| Evo 2 cov | 0.9398 [0.9357–0.9439] | |
| ESM-C mean | 0.9330 [0.9286–0.9374] | |
| Evo 2 mean | 0.9313 [0.9270–0.9356] | |
| ESM-C zero-shot | 0.9299 [0.9253–0.9345] | |
| ESM-C P_PB | 0.9266 [0.9219–0.9313] | |
| ESM2 cov | 0.9218 [0.9171–0.9265] | |
| MSA + struct | 0.9092 [0.9042–0.9142] | |
| ESM-C full-mean | 0.8958 [0.8903–0.9013] | |
| Evo 2 full-mean | 0.8952 [0.8899–0.9005] | |
| ESM2 P_PB | 0.8802 [0.8742–0.8862] | |
| ESM2 zero-shot | 0.8776 [0.8701–0.8851] | |
| ESM2 mean | 0.8692 [0.8630–0.8754] | |
| MSA log-odds | 0.8543 [0.8476–0.8610] | |
| ESM2 full-mean | 0.8255 [0.8183–0.8327] | |
| Structure-only | 0.8086 [0.8012–0.8160] |
Bars scaled over AUROC 0.80–0.95. ESM-C · AlphaMissense · Evo 2 · ESM2 / MSA / structure.
Figure 2 · Which layer carries the signal?
Pathogenicity separability is not uniform across the network. Probing the representation at each depth shows linear P/B separability climbing through ESM-C's 80-layer stack, plateauing through the middle, and then rising to a clear peak in the deep layers around layer 76–78 (AUROC ≈ 0.93, well above the L60 reference) — which is why every probe in this study reads off L78.
The same pattern holds when ESM-C and ESM2 are placed on a common axis of relative depth: both gain most of their pathogenicity signal in their deepest layers, and ESM-C edges ahead of ESM2 at the very top of the stack.
The full leak-free, prospectively-validated version of this benchmark — retraining on what was known at each past ClinVar cutoff and predicting the then-unknowns — is in the Temporal analyses section.
Temporal Analyses
A pathogenicity predictor is only worth trusting if it can call variants it has never seen resolved. The temporal-cutoff experiment is our hardest, leakage-free test of exactly that.
Why a temporal test, and what it validates
The central hazard in benchmarking variant-effect models is circularity: most "test" variants are already in ClinVar, so a probe trained on today's labels — or a model whose pretraining absorbed them — is partly recalling the answer rather than predicting it. A high AUROC then measures memorisation, not clinical foresight.
The temporal-cutoff method removes that hazard by borrowing labels from the future. We freeze ClinVar at an old snapshot T0 ∈ {2021-06, 2023-06, 2024-06}, take the variants that were "Uncertain significance" at T0 but have since been reclassified to Pathogenic or Benign in the current release, and use those resolved former-VUS as a held-out answer key. The probe is retrained from scratch on only the variants known P/B at T0, so every test variant was a genuine unknown at training time.
The ranking is stable across all three snapshots and led by ESMC. ESM-C covariance pooling
tops every panel — AUROC 0.932 / 0.924 / 0.939 (2021 / 2023 / 2024) — with ESM-C mean pooling
essentially tied (0.931 / 0.926 / 0.934). Both edge AlphaMissense (0.927 / 0.913 / 0.925)
and Evo 2 covariance (0.926 / 0.915 / 0.936), with ESM2 covariance close behind
(0.912 / 0.915 / 0.925). The supervised single-probe ESM-C P_PB is a notch lower
(0.917 / 0.911 / 0.917); the previous-generation ESM2 P_PB sits at
0.866 / 0.870 / 0.880, and MSA conservation is last (≈0.80). Crucially, the
leakage-controlled probe matches or beats the "leaked" probe trained on today's full labels, so
there is no memorisation inflation, and a structure-only baseline (pLDDT + rSASA)
reaches only ≈0.80–0.81 — ESMC adds a consistent margin of beyond-structure signal on top.
Because the test variants were unlabelled at training time, this is direct evidence that the ESMC probe prospectively predicts future clinical reclassifications — exactly the capability a VUS-resolution tool needs.
Experimental validation (DMS)
ClinVar labels are expert clinical opinions. Deep mutational scanning (DMS) is a different kind of evidence entirely — a direct wet-lab fitness readout for thousands of mutations in a single protein. If the probe reads real biology, it should track DMS fitness on proteins it never trained on.
On leak-free, non-human enzymes (entirely absent from ClinVar training), the ESM-C covariance probe anti-correlates strongly with measured fitness — direct evidence it reads genuine loss-of-function, not human-annotation artefacts. A negative ρ is the correct sign: higher predicted pathogenicity should mean lower experimental fitness.
| DMS assay | n | Spearman ρ | AUC vs LOF |
|---|---|---|---|
| BLAT_ECOLX (β-lactamase) | 4,783 | −0.751 | 0.896 |
| AMIE_PSEAE (amidase) | 6,227 | −0.613 | 0.826 |
| DYR_ECOLI (DHFR) | 2,916 | −0.523 | 0.882 |
| CBS_HUMAN (in-distribution ref.) | 7,217 | −0.347 | 0.702 |
The three held-out non-human assays are the honest test, and the probe does well on all of them;
CBS_HUMAN is shown only as an in-distribution reference. The same pattern is visible whether we use
the bilinear covariance probe or the logcov variant — both clear the
loss-of-function-vs-tolerated bar comfortably.
cov_bilinear and logcov probes.Per-residue annotation probes
Beyond one pathogenicity number, can the embedding tell us why a variant is disruptive? We train a battery of per-residue annotation probes — one per biological property — and ask, for each, whether the model genuinely adds signal, then confront the severe class imbalance these biological labels carry.
How we decide what is worth probing: lift over baseline
A high raw AUROC is not enough — many "annotations" are recoverable from amino-acid identity or simple chemistry alone. The goal of a lift-over-baseline analysis is to compute the probe's score minus the best model-free baseline trained on the same split. We compare against two baselines:
- [1] One-hot amino acid. A classifier that sees only which of the 20 residues is present. If a property is mostly determined by amino-acid identity, this baseline already captures it — and the probe deserves no credit for re-deriving it.
- [2] Physico-chemistry only. Hand-computed sequence chemistry (Kyte–Doolittle hydropathy, charge, complexity, position, FoldIndex). This asks whether the annotation falls out of basic chemistry rather than the learned representation.
Lift — not raw score — is the honest number, and it cleanly separates learned structural channels from identity-shortcut ones. The figure below shows AUROC-lift and AUPRC-lift for every head on one common axis (positive = the embedding beats the best baseline):
The same numbers, tabulated, with the per-probe verdict:
ESMC-6B L78, all-residue training, homology-aware CV. Binary heads: AUROC / AUPRC; regression heads (italic): R² / Spearman. Lift = probe − best of {one-hot AA, physchem}, on AUPRC (binary) or Spearman (regression).
| Probe | AUROC / R² | AUPRC / ρ | lift | verdict |
|---|---|---|---|---|
| pLDDT (R_PLDDT_WT) | 0.871 | 0.840 | +0.650 | learned ✓ |
| Intrinsic disorder (P_DISORDER) | 0.981 | 0.930 | +0.621 | learned ✓ |
| Conservation (R_CONSERVATION) — new | 0.639 | 0.789 | +0.575 | learned ✓ |
| Contact number (R_CONTACT_NUMBER) — new | 0.807 | 0.902 | +0.573 | learned ✓ |
| Solvent accessibility (R_RSASA_WT) | 0.816 | 0.905 | +0.566 | learned ✓ |
| Buried / exposed (P_RSASA_BIN_WT) | 0.961 | 0.929 | +0.431 | learned ✓ |
| Secondary structure (P_SS3_WT) | 0.823 | 0.783 | +0.414 | learned ✓ |
| Signal peptide (P_SIGNAL_PEP) | 0.995 | 0.983 | +0.834 | learned ✓ |
| Transmembrane (P_TRANSMEM) | 0.982 | 0.809 | +0.711 | learned ✓ |
| Active site (P_ACTIVE_SITE) | 0.982 | 0.778 | +0.743 | learned ✓ (was shortcut on ESM2) |
| Binding site (P_BINDING_SITE) | 0.945 | 0.682 | +0.555 | learned ✓ |
| In domain (P_IN_DOMAIN) | 0.862 | 0.611 | +0.415 | learned ✓ |
| Disulfide (P_DISULFIDE, within-Cys) | 0.924 | 0.459 | +0.383 | learned ✓ |
| Phosphorylation (P_PHOSPHO) | 0.960 | 0.586 | +0.288 | identity-leaning |
| Ubiquitination (P_UBIQUITIN) | 0.958 | 0.345 | +0.230 | identity-leaning |
The channels that earn the embedding — fold confidence, disorder, conservation, packing
geometry, accessibility, secondary structure, transmembrane/signal topology, active & binding
sites — lift +0.4 to +0.85 over identity, exactly the channels a mechanistic disruption profile
should lean on. Two notes: (1) on ESMC, active site is now genuinely learned (+0.74), where on
ESM2 it was an identity shortcut; (2) the residue-restricted PTM heads (phospho, ubiquitin) post high
raw AUROC but small lift — they are useful as rare, specific flags, not as mechanism. The full
~40-head battery is tabulated in docs/maps_report.md. A caveat we hold honestly:
projecting the WT→mutant delta onto these named channels is lossy (it separates P/B at AUROC
≈0.65 versus ≈0.91 for the raw delta probe), so the channel profile is an explanation tool,
not a competing classifier.
Class imbalance across annotation features
Biological site labels are extremely rare — many features have well under 1% positives, which is why AUPRC (not accuracy) is the honest metric and why several heads are "positive-starved". The widget shows the positive prevalence of every binary annotation probe (sort by prevalence or by embedding lift; toggle a log scale to see the rarest features).
The full probe catalog — and an honest status for each
Below is the complete scope of what we have probed (or scoped to probe) so far, grouped by what kind of biology each channel captures. The status column is the honest verdict: ✓ built & learned (positive lift over the best baseline — worth leaning on in a mechanism profile); ◐ built but identity-leaning / rare (useful as a flag, but its raw score partly reflects amino-acid identity or it is coverage-fragile); ○ planned (designed, not yet trained — the most promising of these, conservation-coevolution and ΔΔG stability, are the clearest levers for future work).
| Probe | What it measures | Label source | Status |
|---|---|---|---|
| Relative solvent accessibility | How exposed vs. buried a residue is — continuous (RSASA) and a buried/exposed binary. | DSSP ACC ÷ per-AA max ASA (Tien 2013). | ✓ learned |
| Secondary structure (SS3/SS8) | Helix / strand / coil class per residue. | DSSP on PDB (SIFTS-mapped) or AlphaFold. | ✓ learned |
| pLDDT (fold confidence) | AlphaFold's per-residue confidence; low pLDDT often flags disorder/flexibility. | AlphaFold DB B-factor column. | ✓ learned |
| Intrinsic disorder | Whether a residue lies in a region with no stable folded structure. | DisProt / MobiDB (gold) + IUPred3 (silver). | ✓ learned |
| Contact number | Count of residues packed within 8 Å — local density / burial. | Cβ–Cβ contacts from PDB/AlphaFold. | ✓ learned |
| Half-sphere exposure / long-range contacts / contact order | Directional exposure and how sequence-distant a residue's contacts are. | Computed from structure coordinates. | ✓ learned |
| Residue depth / packing | Distance to the molecular surface — a finer burial measure than the binary. | MSMS residue depth / coordination number. | ○ planned |
| Backbone torsion (φ/ψ) | Ramachandran dihedral angles defining backbone geometry. | φ,ψ from structure; regress on sin/cos. | ○ planned |
| Probe | What it measures | Label source | Status |
|---|---|---|---|
| Transmembrane | Whether a residue lies in a membrane-spanning segment. | DeepTMHMM, TOPDB, OPM, UniProt TRANSMEM. | ✓ learned |
| TM orientation (in/out) | Which side of the membrane; helix-bundle vs β-barrel. | OPM / TOPDB orientation labels. | ✓ learned |
| Signal peptide | N-terminal cleaved sorting signal targeting secretion. | SignalP 6.0 set; UniProt SIGNAL. | ✓ learned |
| Intramembrane | Membrane-embedded but non-spanning segment. | UniProt INTRAMEM. | ✓ learned |
| Probe | What it measures | Label source | Status |
|---|---|---|---|
| Active site | Residue directly involved in catalysis/function. | UniProt ACT_SITE. | ✓ learned |
| Catalytic (mechanism-curated) | Stricter, mechanism-level catalytic residue. | M-CSA (Catalytic Site Atlas). | ○ planned |
| Metal binding | Residue coordinating a metal ion (Zn, Fe, Ca, Mg…). | BioLiP2 / MetalPDB / UniProt BINDING. | ◐ modest |
| Ligand / binding site | Residue lining a small-molecule pocket. | BioLiP2 (biologically-relevant ligands). | ✓ learned |
| Nucleic-acid binding | Residue contacting DNA/RNA; incl. zinc-finger. | BioLiP2 nucleic entries; DNAproDB. | ✓ learned |
| Protein–protein interface | Residue on an interaction surface. | ΔrSASA(monomer→complex); ScanNet/MaSIF. | ○ planned |
| Disulfide | Whether a cysteine is in a covalent S–S bond. | PDB SSBOND; UniProt DISULFID. | ✓ learned |
| In domain / region / repeat / coiled-coil | Membership in an annotated structured domain or region. | InterProScan (Pfam); UniProt DOMAIN. | ✓ learned |
| PTM — N-glycosylation | N-linked glycosylation sequon (N-X-S/T). | UniProt CARBOHYD; dbPTM. | ✓ learned |
| PTM — phospho / O-glyco / ubiquitin / acetyl / methyl | Residue-restricted modification sites. | PhosphoSitePlus; dbPTM; EPSD. | ◐ identity-leaning |
| Short linear motif (SLiM) | Binding/targeting motif, often in disordered regions. | ELM database instances. | ○ planned |
| Probe | What it measures | Label source | Status |
|---|---|---|---|
| Conservation | How evolutionarily invariant a position is — a top variant-effect signal. | MSA → Rate4Site / entropy / ConSurf. | ✓ learned |
| Coevolution | How strongly a position co-varies with others (3D contact / epistasis). | EVcouplings / CCMpred couplings (APC). | ○ planned |
| Probe | What it measures | Label source | Status |
|---|---|---|---|
| ΔΔG sensitivity | How much, on average, mutating this position destabilises the fold. | Tsuboyama 2023 mega-scale folding stability. | ○ planned |
| Pathogenicity (P_PB) | Pathogenic vs benign — the headline predictor. | ClinVar P/LP vs B/LB (38,960; ≥1★). | ✓ headline |
Evaluation throughout: binary heads report AUROC + AUPRC as lift over the best of {one-hot AA, physico-chemistry}; regression heads report R²/Spearman; all on homology-aware (MMseqs2 id30) splits, with external sanity checks where available (CB513 / NetSurfP for structure, ConSurf-DB for conservation, ProteinGym DMS for pathogenicity). The most promising ○ planned channels — conservation/coevolution and ΔΔG stability — are exactly the axes EVEE and stability predictors exploit, and are the clearest levers to push the mechanistic profile from explanation toward prediction.
Mechanistic profiles & GPT-5.5
The pathogenicity score says whether; the annotation probes say why. Together they turn one opaque number into a per-variant disruption profile — a readable account of which structural and functional channels a mutation breaks.
From delta vector to disruption profile
For any variant, we embed the wild-type and the mutant, take the difference, and ask each disruption probe how much its decoded property shifts: does the residue become more exposed? does the local secondary structure change? does a binding-site or disulfide signal weaken? The result is a channel-by-channel bar chart — the disruption profile you can open for any variant in the Map and Tracks views. We hold one caveat honestly: projecting the WT→mutant delta onto named channels is lossy (it separates P/B at AUROC ≈ 0.65 vs ≈ 0.91 for the raw delta probe), so the profile is an explanation tool, not a competing classifier — and channels that partly read amino-acid identity (active-site, metal) are flagged as such.
GPT-5.5 synthesis
A bar chart of fourteen channels is still work to read. So we hand the structured profile —
the substitution, the P_PB score, and the per-channel shifts — to GPT-5.5, which writes a
short, grounded mechanistic explanation: what the call is, which genuine channels drive it, and the
relevant side-chain chemistry, strictly constrained to the channels present in the data. The
model is instructed never to invent a channel, to lead with conservation and structure when they
move, and to say plainly when the profile is near-silent and the call rests on the holistic
embedding rather than a decomposable mechanism. The API key lives server-side only (in
serve.py's /api/explain endpoint) and is never shipped to the browser.
What this points toward
- More probes — close the conservation gap. The two strongest variant-effect axes, evolutionary conservation and folding stability (ΔΔG), are still missing from the channel battery — exactly the signal EVEE captures. Adding Tier-1 conservation/coevolution probes is the most likely lever to push the mechanistic profile from explanation toward prediction.
- Probe ESMFold2 / structure-aware models. Extend the same probing recipe to structure-prediction representations to test whether explicit structural features carry additional, complementary variant-effect signal.
- Train an SAE on ESMC layer 78. Move from supervised, label-defined probes to unsupervised feature discovery: a sparse autoencoder on the L78 residue stream could surface interpretable, monosemantic features of variant disruption that we did not know to label for — a label-free corroboration of the per-residue mechanism.
References
Models & variant-effect predictors
- EVEE — Explaining Genetic Variants. Goodfire Research. goodfire.ai/research/evee-explaining-genetic-variants — the variant-explanation method used as the evolutionary-model comparison throughout.
- ESM-C (ESM Cambrian) and ESM2. EvolutionaryScale / Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023).
- AlphaMissense. Cheng et al., Accurate proteome-wide missense variant effect prediction with AlphaMissense, Science (2023).
- Evo 2. Arc Institute — genomic foundation model used as the covariance/mean-pool baseline.
- ESM-1v. Meier et al., Language models enable zero-shot prediction of the effects of mutations on protein function, NeurIPS (2021).
Structure & annotation sources
- AlphaFold & AlphaFold DB (pLDDT). Jumper et al., Nature (2021); alphafold.ebi.ac.uk.
- ClinVar. Landrum et al., NAR (2018); ncbi.nlm.nih.gov/clinvar.
- UniProt functional annotation (sites, topology, PTMs, domains). uniprot.org.
- DSSP secondary structure & solvent accessibility. Kabsch & Sander (1983); relative ASA normalised by max-ASA values from Tien et al., PLoS ONE (2013).
- Specialist label sources: DisProt / MobiDB / IUPred3 (disorder); DeepTMHMM, TOPDB, OPM (topology); SignalP 6.0 (signal peptide); M-CSA (catalytic); BioLiP2 / MetalPDB (ligand & metal binding); PhosphoSitePlus / dbPTM / EPSD (PTMs); InterPro / Pfam (domains); ELM (motifs).
Evolution, stability & evaluation
- Conservation / coevolution: Rate4Site & ConSurf (Ashkenazy et al.); EVcouplings / CCMpred for coevolutionary couplings.
- Folding stability (ΔΔG): Tsuboyama et al., Mega-scale experimental analysis of protein folding stability, Nature (2023).
- DMS benchmark: ProteinGym — Notin et al. (2023); proteingym.org.
- Homology-aware splits: MMseqs2 — Steinegger & Söding, Nat. Biotechnol. (2017).
- External structure benchmarks: CB513; NetSurfP-2.0 (Klausen et al., 2019).