In 2026, Ernest Mordret and colleagues at the Institut Pasteur trained three complementary transformer models on 123 million bacterial proteins drawn from 32,000 genomes.
What they uncovered is staggering: bacterial immunity is far larger, more diverse, and more computationally elegant than anyone realized. Fewer than 250 antiphage systems had ever been experimentally validated. The new atlas contains 2.39 million predicted antiphage proteins. Roughly 1.5% of a typical bacterial genome is now understood to be defense infrastructure—three times previous estimates. More than 85% of the predicted protein families had no prior link to immunity whatsoever.
The tension is obvious: we thought we had mapped bacterial defense. We had barely scratched the surface. The payoff is immediate. Language models just turned the bacterial pangenome into an actionable, programmable resource for next-generation synthetic biology.
The Grammar of Immunity — Genes as Words, Operons as Sentences
The team built three models that attack the problem from orthogonal angles. ALBERTDF treats genome neighborhoods as sentences and genes as words, learning “defensiveness” purely from genomic context with zero sequence information. ESMDF fine-tunes the ESM2 protein language model (via LoRA adapters) to classify proteins directly from amino-acid sequence. GeneCLRDF fuses both signals through contrastive learning, teaching the model that a gene’s defensive identity can be inferred from either its sequence or its neighborhood.
The joint embedding hits 99% precision and 92% recall on held-out data—far beyond either model alone. This is not pattern matching. It is learning the syntax of evolved molecular computation.
From Prediction to Validation — 12 New Systems, Zero Prior Link
The models were not content with prediction. Twelve entirely novel antiphage systems were experimentally validated in both E. coli and Streptomyces albus. Some contain canonical defense domains. Others involve proteins with no known association to immunity at all.
This is the moment the field has been waiting for: AI no longer just accelerates discovery—it surfaces functional biological code that human annotation pipelines would have missed for decades.
The Scale Changes Everything — 23,000 Candidate Operon Families
Applied across the full dataset, the models surface ~23,000 candidate operon families. The implications for synthetic-programmable-biology are direct: new nucleases, molecular effectors, and programmable antimicrobial mechanisms are now sitting in a ready-made discovery pipeline. Phage therapy, gene circuits, living materials, and biohybrid systems all just received a massive injection of raw computational substrate.
Biology Was Always the Original Computer — AI Just Learned to Compile It
Every antiphage system is an evolved program executing detection, decision, and destruction logic inside a living cell. The fact that transformers can now parse that logic at planetary scale proves the thesis we have championed from day one: biology is computation. The pangenome is not data. It is executable code.
The real question is no longer what nature has already written. It is what biological computers we will dare to compile next.
References
- Mordret, E. et al. (2026). Protein and genomic language models uncover the unexplored diversity of bacterial immunity. Science. https://www.science.org/doi/10.1126/science.adv8275
Related: What Is a Biocomputer in 2026? · Programmable Biology: When Cells Become Living Software
Feature image: AI-generated using Grok