In Vivo Mutagenesis for AI Training Data: Why Stochastic Diversity Outperforms Designed Libraries

Machine learning models for protein fitness prediction are only as good as the data they train on. The standard approach to generating training data is deep mutational scanning with synthetic libraries: design a set of variants, synthesize them, screen them, and use the functional measurements as labeled training data.

This works. But it has a structural limitation that becomes visible at scale: synthetic libraries encode the biases of the designer.

NNK saturation mutagenesis overrepresents certain amino acid substitutions (leucine, serine, arginine each appear at 3/32 frequency while methionine and tryptophan appear at 1/32). Site-directed combinatorial libraries only sample positions the designer chose to diversify. Even “comprehensive” single-site saturation libraries cover only 19 substitutions per position, missing insertions, deletions, and multi-site epistatic combinations entirely.

For a predictive model, these biases mean the training data systematically underrepresents exactly the regions of sequence space where the model’s predictions are most uncertain. The model learns well in the neighborhood of the wild-type sequence and the designer’s hypotheses, but extrapolates poorly to novel combinations.

In vivo mutagenesis offers a fundamentally different data generation strategy.

How in vivo mutagenesis systems work

In vivo mutagenesis introduces mutations directly into DNA inside living cells, bypassing the synthesis-cloning-transformation bottleneck entirely. Several systems are now mature enough for production use.

CRISPR-guided base editors (cytidine deaminases or adenine deaminases fused to catalytically inactive Cas9) introduce C-to-T or A-to-G transitions at guide RNA-specified loci. The editing window is typically 4 to 8 nucleotides wide, and multiple guides can be deployed simultaneously to diversify several regions of a gene in parallel. Editing rates of 20% to 60% per cell division are achievable depending on the editor and genomic context.

Error-prone DNA polymerases (such as PolI variants in the EvolvR system or orthogonal DNA polymerase-plasmid pairs) increase the local mutation rate by 10,000 to 100,000 fold relative to background. These systems operate continuously: every cell division introduces new mutations, and the diversity of the population increases over time without any manual intervention.

Retroelement-based systems (T7-based retrotranscription mutagenesis) continuously mutate a target gene by repeatedly reverse-transcribing it through an error-prone intermediate. Mutation rates of 10^-3 to 10^-4 per base per generation are typical.

The practical consequence is that a single flask culture, grown for 20 to 50 generations, can accumulate millions of unique variants without a single cloning step.

Why stochastic diversity is better for ML training data

The value proposition of in vivo mutagenesis for AI is specific: it generates data that is less biased than synthetic libraries across three dimensions that matter for model generalization.

Substitution uniformity. Error-prone polymerases and base editors do not follow the codon table biases of NNK. The mutation spectrum depends on the enzyme, not on codon degeneracy. This means amino acid substitutions that are rare in NNK libraries (tryptophan, methionine, cysteine) appear at frequencies closer to other residues, reducing the blind spots in the training data.

Multi-site combinations. After 30 to 50 generations of continuous mutagenesis, a significant fraction of the population carries two, three, or more mutations simultaneously. These multi-mutant combinations are the exact data points needed to train models that predict epistatic interactions, which are the interactions between mutations at different positions that cannot be predicted from single-mutant effects alone.

Neutral and deleterious variant coverage. Synthetic libraries are often designed around positions expected to be important. In vivo mutagenesis does not make this distinction. It mutates conserved and variable positions alike, generating a balanced representation of functional, neutral, and deleterious variants. Models trained on this distribution learn where the fitness boundaries are, not just where the peaks are.

Practical considerations

In vivo mutagenesis is not a replacement for synthetic libraries in all contexts. The tradeoffs are real.

Loss of positional control. You cannot specify which positions to mutate with the same precision as site-directed mutagenesis. CRISPR base editors offer partial control (guide RNA targeting), but error-prone polymerase systems mutate the entire target region stochastically.

Mutation spectrum bias. Every mutagenesis system has its own spectrum. Base editors are restricted to transition mutations (C-to-T or A-to-G). Error-prone polymerases tend to favor transitions over transversions. No single system covers all 19 possible substitutions at every position uniformly. Combining multiple systems or correcting for known biases computationally is often necessary.

Fitness selection during growth. Because mutagenesis occurs in living cells, deleterious variants are depleted during growth. This is a form of unintentional selection that biases the variant distribution toward functional sequences. For some applications this is useful (it enriches for expressible variants). For others it introduces a confound that must be accounted for in the training data.

Sequencing depth requirements. The diversity generated by continuous mutagenesis can exceed 10^6 to 10^7 unique variants per experiment. Achieving adequate sequencing coverage requires deep NGS (typically 10x to 50x coverage per variant), which is feasible but adds cost relative to smaller synthetic libraries.

When to use each approach

The decision is straightforward once you frame library construction as a data generation problem.

Use synthetic libraries when you need precise control over which positions are diversified, when you are testing specific hypotheses about structure-function relationships, or when the target region is small enough that saturation mutagenesis achieves complete coverage (fewer than 5 to 6 positions simultaneously).

Use in vivo mutagenesis when the goal is broad coverage of sequence space for model training, when epistatic interactions are important, when you need multi-mutant combinations, or when you want to run continuous evolution experiments that accumulate diversity over many generations.

Use both when the project requires an initial hypothesis-driven screen (synthetic) followed by a broad exploration phase (in vivo) to expand the training data for a second-generation model.

At Ranomics, we treat library generation as a data strategy decision. The molecular biology is a means to an end. The end is a dataset that trains a model capable of predicting variant fitness accurately enough to guide the next round of design.

Start a project

Directed evolution: In vivo mutagenesis campaigns generating unbiased ML training data.
Protein engineering services: Mutagenesis strategy matched to the downstream model and selection.

In Vivo Mutagenesis for AI Training Data: Why Stochastic Diversity Outperforms Designed Libraries

How in vivo mutagenesis systems work

Why stochastic diversity is better for ML training data

Practical considerations

When to use each approach

Related Ranomics services

Ready to start a project?