Google, the US-based technology giant, in collaboration with Yale University, has come up with Cell2Sentence-Scale (C2S-Scale), a family of open-source large language models trained to interpret and understand biology at the single-cell level. C2S- Scale has been trained to “read” and “write” biological data at the single-cell level. It transforms cells into sequences of words, opening up new possibilities for biological discovery.
Since each cell is represented by thousands of numbers, its gene expression measurements require specialized tools and models to analyze. The company has tried to turn those thousands of numbers into a language that humans and language models can understand.
C2S Scale—How it Works?
C2S-Scale transforms each cell’s gene expression profile into a sequence of text, called a “cell sentence,” that consists of a list of the most active genes in that cell ordered by their gene expression level. This makes it possible to apply natural language models, like those used in Google’s Gemini or Gemma models, to scRNA (single-cell RNA sequencing)-seq data.
To read about OpenAI's latest models, OpenAI o3 mini & o4 mini, claimed to be the smartest models yet, click here!
Single-cell RNA sequencing (scRNA-seq) helps in measuring the gene expression of individual cells, revealing what each cell is doing at a given moment. This makes single-cell data more accessible, interpretable, and flexible. And because much of biology, like gene names, cell types, and experimental metadata, is already expressed in text, LLMs are a natural fit for processing and understanding this information.
C2S Scale: How does it convey the information?
C2S- Scale models can answer in natural language, drawing from both the cellular data and biological knowledge they’ve seen during pre-training. The model can automatically generate biological summaries of scRNA-seq data at different levels of complexity, from describing the cell types of single cells to generating summaries of entire tissues or experiments. This enables the researchers to interpret new datasets faster and with greater confidence, even without writing complex code.
This enables conversational analysis, where researchers can interact with their data through natural language in a way that was previously not possible.
Forecasting Cell Behavior
The technology holds immense potential as we head into the future. One of the most exciting applications of C2S-Scale is forecasting how a cell will respond to a disruption. Further, the model can generate a new sentence representing the expected gene expression changes.
To read about OpenAI's latest model, 4.1, helping make coding easy for developers, click here!
This ability to simulate cellular behavior in silico accelerates drug discovery, personalized medicine, and prioritizing experiments before they’re performed in the lab. C2S-Scale represents a major step toward creating a realistic “virtual cell".