Apr 30, 2025 · 8 min read
AI Has Entered the Biology Chat
TranscriptFormer is a new model at the nexus of AI and cell biology, helping scientists explore cross-species biology.

Today’s tools will unlock the treatments of tomorrow. AI experts, biologists and engineers are collaborating to build new technologies that transform how scientists study biology and unlock treatments for human diseases. TranscriptFormer is a new AI model at the nexus of AI and cell biology that is a step forward in this direction.
Read on to learn more in an interview with two of the builders of TranscriptFormer, computational biologist Sara Simmonds and AI researcher James Pearce.
What is a current challenge in biology that you want to build tools to solve?
Sara Simmonds: We want to understand all of the cells in the human body, at single-cell resolution. A big part of that is understanding the differences between cells, and how they acquire those differences. All cells in the human body come from one cell, the fertilized egg. Then they multiply and mature along different lineages, making everything — from a brain cell to heart cells to muscle cells.
James Pearce: This is an especially interesting process to understand when you consider that every cell in your body has your full DNA sequence in it — the same set of instructions. But that DNA is not used the same way in each cell — only certain genes are actually expressed to create a brain cell, and a different set is used in a liver cell, for example.
There are some similarities with language. If you imagine each gene like a word, each cell has its own sentence of words that make it unique. We are building technology to decode these sentences — for every cell, in healthy and disease states, across organisms — so we can understand this very fundamental part of biology.
How are you collaborating to solve this challenge?
SS: Our team includes AI researchers, engineers, biologists, data scientists and technical program managers.
JP: AI researchers, like me, work on actually building and training the models. Some of the engineers are on the AI team, others do more central engineering work to help with the infrastructure for our tools. Then you have the computational biologists, like Sara, who help with the data collection and processing, as well as on the evaluation of tasks that the model can do.
SS: We worked closely, meeting a lot at the beginning to figure things out. We knew we wanted to build a model trained on over a hundred million cells and were thoughtful about which data would be most meaningful for this type of model, pulling a lot of it from CZ CELLxGENE, CZI’s open platform for accessing and analyzing standardized single-cell data. This type of data quantifies whether a given gene is being used, or “expressed,” in an individual cell and, if so, to what level.
The model you ended up building, TranscriptFormer, was recently released. What is it?
JP: TranscriptFormer is a powerful AI model to explore how cells vary — between different tissues within an organism, in different states like infection or disease, and even across species. It was trained on millions of cells from many tissues, across 12 different species spanning 1.5 billion years of evolution.
TranscriptFormer is a much more concise representation of the massive amount of data it was trained on and learns to generalize from it. And that diverse and complex training data means it can pick up patterns of gene expression that represent different types of cells, or even potentially those indicative of disease.
SS: It’s a lot of open ground for scientists to explore. Most models are built with a specific task to be good at, but this is a generalizable model that should be good at doing a lot of things.
Dive deeper: TranscriptFormer — A Generative Cross-Species Cell Atlas Across 1.5 Billion Years of Evolution
What will TranscriptFormer help scientists do that they couldn’t do before?
SS: One of the goals with this type of model is to be able to do experiments “in silico” (in a computer) or have it generate hypotheses that researchers can go test in the lab. We want to save researchers time plus speed up the process of discovery.
JP: Researchers collect new data on individual cells all the time, and spend a decent amount of time trying to figure out what’s in their data. TranscriptFormer can be used to help with that by annotating new datasets: figuring out what types of cells are there, or even exploring if any cells had an infection or other deviations that might indicate an issue.
SS: I hope it can be useful for engineering cells for cell therapies, where you need to understand exactly how cell fate decisions are made. TranscriptFormer can help identify the key molecular regulators, called transcription factors, that guide those maturation programs in cells.
JP: What’s unique is that it’s the first generative model to translate those patterns of gene expression across species, too. In the future, AI models trained across species could be used to make predictions about whether findings from model organisms like mice will translate to human cells before doing clarifying or validating experiments in the lab.
How does TranscriptFormer fit into CZI’s goal of building a virtual cell model?
JP: We envision the virtual cell as a system composed of models that interact in a shared universal representation space. TranscriptFormer could be one of those foundational models that allows you to get representations of naturally varying single-cell molecular data. Other models may focus on other modalities, like microscopy, for example. The ability to look at a cross-section of data from multiple angles or modalities will give a more holistic picture on cell biology.
Why is this approach of applying AI to cell biology and building virtual cell models so necessary for understanding biology and, ultimately, human health?
SS: TranscriptFormer was trained on data from 12 different species representing over 1.5 billion years of evolutionary history. That can reveal far more patterns and complexity about the molecular underpinnings of cells compared to traditional methods that don’t leverage the compute power of AI.
JP: In language, each word has a definition but the meaning of that word can change depending on the context of the sentence or paragraph. In biology, the meaning and importance of each gene is really dependent on the context, too. Very large AI models like we’re using here based on transformers do an amazing job of faithfully modeling data when you have an enormous amount of that data to train them on.
Understanding the subtle molecular differences between cells and potential interactions of transcription factors that control those differences can clue us into the biology of diseases. This will allow researchers to ask specific questions about how cells change in disease and get answers based on more data at once, because it will be a lot easier to extract insights directly from models that you can prompt rather than having to interrogate all of the different datasets out there. As the amount of that data scales up, they become more and more useful.
Why is this such an interesting time for biology?
SS: There are a lot of different ways to use AI but I think it will become ubiquitous in science. It’s like asking how we use the Internet, or how can you use a computer — in every way possible! We are just at the beginning of this revolution. AI is going to be very powerful and have a huge impact on science.
JP: Transformer models are able to process enormous amounts of data very efficiently, whereas previous models had bottlenecks that prevented them from doing that. They work extremely well when you have a lot of data — they’re very data hungry.
The data these models are trained on can be broken down into “tokens” — in the language models that most people are familiar with, a token could be a word. In the case of TranscriptFormer, an individual gene from an individual cell is a token. Large language models have been so successful because they have the enormity of data on the Internet to pull from for training. But we didn’t have enough tokens for biology to leverage these types of transformer models until somewhat recently. The technologies used in the lab to get the molecular data from cells that make up the tokens keeps getting better and more efficient. That means in the next few years, the amount of high-quality, complex biological tokens for training could even surpass the amount of high-quality text tokens on the Internet. If you think about how large language models have helped us parse the Internet, and then consider the equivalent impact for biology … that’s why we’re so excited about this work.
Learn how we’re accelerating the pace of biomedical research.
Watch James describe this work, read the TranscriptFormer preprint, and visit the virtual cells platform.