PlantCAD + GeneCAD
For decades, the field of plant genetics has been rich in sequence data but poor in interpretation. We can now read plant genomes cheaply and at scale, yet we still struggle to predict which nucleotide changes will matter and how they influence agronomically and ecologically important traits such as yield, stress tolerance, and adaptation. Until recently, even the best genetic machine learning models often fell short here, struggling to determine whether a perturbation would be deleterious or even impact phenotype at all.
To close this interpretation gap, we are building AI foundation models for plant genomics and crop improvement. PlantCAD2 is our family of plant DNA foundation models, a long-context language model trained across diverse flowering plant genomes to learn both conserved and lineage-specific sequence patterns. By learning the statistical and evolutionary structure of plant DNA directly from sequence, PlantCAD2 gives us a reusable representation for prioritizing potentially deleterious variants and for downstream tasks that connect sequence to molecular function.
GeneCAD builds on PlantCAD2 to tackle a persistent bottleneck: high-quality gene annotation. Many plant genomes are large, repetitive, and frequently polyploid, which makes it hard to map gene boundaries and splice structure consistently. With GeneCAD, we use PlantCAD2’s learned sequence features to infer complete gene models from DNA alone, identifying transcriptional units and splicing structure coherently across the genome. This enables comparative and functional genomics at scale, including more uniform gene catalogs for understudied crops, transcript-aware interpretation of variants, especially those affecting splicing and coding potential, and clearer targets for downstream experiments.
In parallel, we are extending this sequence-to-structure-to-function pipeline toward condition-aware models that connect variants to the molecular programs they alter. In particular, we are interested in modeling how DNA is transcribed into RNA and then translated into protein, capturing when this happens, in which cells and tissues, and under what environmental conditions. By tying sequence variation to these context-dependent molecular outcomes, we are laying the groundwork for more direct genotype-to-trait modeling.


