Blog
/
LLMs for Single-Cell Data: Testing the C2S Framework on Real Neuro Datasets

Biotech

LLMs for Single-Cell Data: Testing the C2S Framework on Real Neuro Datasets

Single-cell datasets are growing in size and complexity, but existing analytical methods struggle to interpret them at scale. In raw form, each cell is represented as a high-dimensional matrix of gene expression values, not as structured sequences that language models can interpret. LLMs cannot work with raw gene-expression matrices directly – they require structured language

by Andrii Markov

6 min read • Feb 16, 2026 •

32 views

Table of Content

Single-cell datasets are growing in size and complexity, but existing analytical methods struggle to interpret them at scale.

In raw form, each cell is represented as a high-dimensional matrix of gene expression values, not as structured sequences that language models can interpret.

LLMs cannot work with raw gene-expression matrices directly – they require structured language transformation.

*C2S-Scale orders gene names by expression and converts them into natural language “cell sentences”.*

We conducted a structured evaluation of the C2S pipeline on a 10,000-cell ALS motor cortex dataset, measuring reconstruction loss and assessing Gemma-based model performance in supervised cell-type classification.

Raw Single-Cell Data Is Structurally Incompatible with LLMs

Single-cell transcriptomic data is not inherently compatible with language models because it is encoded as high-dimensional numerical matrices rather than structured token sequences.

In its native form, single-cell data is represented as transcript count matrices, where each cell is described by tens of thousands of gene expression values.

In our experimental dataset, after preprocessing and filtering, each of the 10,000 sampled cells was represented across 29,865 genes (originally 35,467 before filtering).

This creates three structural incompatibilities with LLM architectures:

Continuity and scale – Expression values are continuous numerical quantities, not discrete linguistic tokens.
High dimensionality – Each cell is defined across tens of thousands of variables, far exceeding standard token windows.
Non-linear biological structure – Even normalized features often lack linear relationships, complicating both classical ML and deep learning interpretation.

Existing single-cell LLMs (scBERT, GeneCompass, scGPT, scFoundation) attempt to model such data directly but still struggle with:

Batch effects
Dropout artifacts
Sparse matrices
Rare cell populations
Zero-shot generalization

Crucially, discretizing continuous gene expression into tokens introduces quantitative precision loss, which affects the ability to distinguish subtle cell states.

A practical example of structural limitation appears even before applying LLMs.

Manual gating strategies in immunology fail to identify rare subpopulations lacking distinct surface markers.

Similarly, standard ML pipelines struggle when relationships between normalized features are non-linear and context-dependent.

In our dataset:

75,583 total cells were available
10,000 cells were sampled due to infrastructure limits.
After normalization and filtering, dimensionality remained extremely high (29,865 genes per cell).

This dimensionality illustrates the core incompatibility: LLMs are optimized for sequential token prediction.

The primary limitation in applying LLMs to single-cell biology is representation format.

Until numerical expression matrices are transformed into structured linguistic sequences, LLM architectures remain structurally misaligned with transcriptomic data.

C2S Pipeline and Model Evaluation

Transforming gene expression profiles into ranked gene “sentences” enables compatibility with language models. However, this transformation does not eliminate biological complexity and introduces measurable trade-offs that depend on model capacity and infrastructure constraints.

The C2S framework converts normalized transcript count matrices into ordered token sequences by ranking genes according to expression level and concatenating gene names into structured “cell sentences”.

C2S Framework pipeline (Levine et al., 2024)

The pipeline implemented in our evaluation consisted of:

Dataset preprocessing and Scanpy normalization
Log transformation and rare gene filtering
Rank-based gene ordering
Conversion to CSData format
Reconstruction quality assessment
Downstream cell-type prediction using Gemma-based models

After subsampling, 10,000 cells were analyzed from an original dataset of 75,583 primary motor cortex cells (ALS and controls).

Following filtering, each cell remained represented across 29,865 genes.

UMAPs showing data distribution inside the filtered dataset. A – by cell type, B – by patient’s sex, C – by control versus ALS patients.

To evaluate information preservation, we reconstructed expression profiles from generated cell sentences using calculated slope (-0.00684) and intercept (0.084275) parameters.

Reconstruction quality assessment indicated that approximately 80% of expression information was preserved after transformation.

However, two constraints remained:

Reconstruction is not lossless – ranking removes absolute quantitative relationships.
Model capacity matters – smaller models show reduced ability to detect atypical or rare cell populations.

Due to hardware limitations (Gemma-27B requires ~50GB RAM), evaluation was conducted using C2S-Scale-Gemma-2B.

Using the 10,000-cell ALS motor cortex subset:

Data retention after C2S conversion ≈ 80%
Common cell types (oligodendrocytes, glutamatergic neurons, astrocytes) achieved F1 scores > 0.5
Rare or atypical brain cell types showed F1 < 0.5

UMAP comparison between original and reconstructed datasets demonstrated preserved global cluster structure with minor distributional drift.

UMAPs showing a comparison of data distribution inside the original dataset and the reconstructed dataset. A – by cell type, B – by control versus ALS patients.

We truncated each cell sentence to the top 300 expressed genes to reduce computational load while keeping the dominant signal.

After C2S conversion:

~80% of expression structure was retained
Global UMAP clusters remained stable
Minor structural drift was observed

Classification performance showed a clear pattern:

Major cell types (oligodendrocytes, astrocytes, glutamatergic neurons) → F1 > 0.5
Mid-frequency populations → moderate performance
Rare cell types → F1 < 0.5

Errors were concentrated in low-frequency or biologically similar cell populations.

Cell-type prediction performance scoring

Evaluation was performed using C2S-Scale-Gemma-2B due to infrastructure limits (Gemma-27B requires ~50GB RAM).

C2S makes single-cell data compatible with LLMs while preserving overall biological structure.

Trade-offs remain:

Ranking reduces quantitative precision
Smaller models struggle with rare populations
Performance depends on available compute

Dominant transcriptional programs are preserved. Subtle states are more sensitive to compression.

C2S enables practical LLM-based analysis of single-cell data.

On a 10,000-cell ALS motor cortex dataset:

~80% structural retention
Reliable classification of major cell types
Rare populations remain challenging under 2B constraints

The main limitation is model capacity. Under realistic hardware settings, C2S delivers biologically meaningful results. Larger models may improve rare-cell detection but significantly increase computational cost.

Conclusion

The C2S framework addresses a fundamental bottleneck in applying large language models to transcriptomic data: representation format.

Our evaluation confirms that:

Rank-based transformation preserves the majority of global expression structure (~80% retention).
LLM-based classification of dominant cell types is feasible under constrained infrastructure.
Performance on rare populations remains sensitive to model capacity and compression effects.
Scaling to larger models improves potential accuracy but significantly increases computational cost.

C2S should therefore be viewed not as a replacement for established bioinformatics pipelines, but as an interface layer between omics data and generative AI systems. Its strength lies in enabling language-model reasoning over biological data – while classical statistical and systems biology methods remain essential for quantitative rigor.

For biotech teams exploring production-grade integration of LLMs into drug discovery or multi-omics workflows, the key challenge is system design: data preprocessing, validation, evaluation metrics, and infrastructure alignment.

This is where structured implementation matters.

Andrii Markov Head of Scientific AI Systems at Blackthorn AI

At Blackthorn AI, we focus on designing reproducible AI-biotech pipelines

that balance biological fidelity, computational efficiency, and deployment constraints

Turn experimental frameworks into scalable research infrastructure

Written by

Andrii Markov Head of Scientific AI Systems

at Blackthorn AI

Discover More
Related Articles

All Articles

Biotech

Pre-Seed Biotech Demos: Technical Red Flags and Signals

8 min read • Feb 05, 2026

by Gary Martin

Biotech

Proof of Concept for Biotech Startups: How to Build an Investor-Ready PoC Before Fundraising

14 min read • Feb 04, 2026

by Gary Martin

Biotech

AI-Powered Target Identification: How to Identify Disease-Relevant Targets with AI

11 min read • Dec 22, 2025

by Andrii Markov

Biotech Healthcare

2025 AI & Biotech Annual Insight Report: A Curated Digest of arXiv and Leading Scientific Research

13 min read • Dec 15, 2025

by Gary Martin

LLMs for Single-Cell Data: Testing the C2S Framework on Real Neuro Datasets

Table of Content

Raw Single-Cell Data Is Structurally Incompatible with LLMs

C2S Pipeline and Model Evaluation

Conclusion

Interesting? Latest Biotech topics directly to your inbox!

Written by

Discover More Related Articles

Pre-Seed Biotech Demos: Technical Red Flags and Signals

Proof of Concept for Biotech Startups: How to Build an Investor-Ready PoC Before Fundraising

AI-Powered Target Identification: How to Identify Disease-Relevant Targets with AI

2025 AI & Biotech Annual Insight Report: A Curated Digest of arXiv and Leading Scientific Research

Discover More
Related Articles