Paper List

Bioinformatics

STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings

2025-12-04

This paper addresses the core challenge of generalizing protein function prediction to unseen or newly introduced Gene Ontology (GO) terms by overcomi...
Bioinformatics

Incorporating indel channels into average-case analysis of seed-chain-extend

2025-12-04

This paper addresses the core pain point of bridging the theoretical gap for the widely used seed-chain-extend heuristic by providing the first rigoro...
Theoretical Neuroscience

Competition, stability, and functionality in excitatory-inhibitory neural circuits

2025-12-04

This paper addresses the core challenge of extending interpretable energy-based frameworks to biologically realistic asymmetric neural networks, where...
Bioinformatics

Enhancing Clinical Note Generation with ICD-10, Clinical Ontology Knowledge Graphs, and Chain-of-Thought Prompting Using GPT-4

2025-12-04

This paper addresses the core challenge of generating accurate and clinically relevant patient notes from sparse inputs (ICD codes and basic demograph...
Bioinformatics

Learning From Limited Data and Feedback for Cell Culture Process Monitoring: A Comparative Study

2025-12-03

This paper addresses the core challenge of developing accurate real-time bioprocess monitoring soft sensors under severe data constraints: limited his...
Bioinformatics

Cell-cell communication inference and analysis: biological mechanisms, computational approaches, and future opportunities

2025-12-03

This review addresses the critical need for a systematic framework to navigate the rapidly expanding landscape of computational methods for inferring ...
Epidemiology

Generating a Contact Matrix for Aged Care Settings in Australia: an agent-based model study

2025-12-03

This study addresses the critical gap in understanding heterogeneous contact patterns within aged care facilities, where existing population-level con...
Computational Neuroscience

Emergent Spatiotemporal Dynamics in Large-Scale Brain Networks with Next Generation Neural Mass Models

2025-12-03

This work addresses the core challenge of understanding how complex, brain-wide spatiotemporal patterns emerge from the interaction of biophysically d...

12 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-10

BioinformaticsComputational Biology

Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

Department of Computer Science, University of Tübingen, Tübingen, Germany

Ihor Kendiukhov

30秒速读

IN SHORT: This work addresses the core challenge of extracting reusable, interpretable, and high-performance biological algorithms from the opaque internal representations of single-cell foundation models.

核心创新

Methodology Introduces a three-stage pipeline (direct operator export, lightweight adaptor, task readout) to extract standalone algorithms from frozen foundation model weights without target-dataset retraining.
Biology Discovers a compact (~8-10D) hematopoietic manifold within scGPT's attention geometry, validated with high trustworthiness (0.993) and significant developmental branch structure (e.g., erythroid trajectory ρ=0.768, p=0.0017).
Methodology Demonstrates multi-stage model compression, reducing the operator from 17.5 MB to 0.73 MB without statistically significant performance loss, and provides mechanistic interpretability via a four-factor core explaining 66.2% of ablation impact.

主要结论

The extracted algorithm significantly outperforms established baselines (scVI, Palantir, DPT, etc.) on pseudotime-depth ordering (orientation-independent |ρ|=0.439 vs. 0.331 for next-best; Wilcoxon BH-q≤2.7×10−7 on all paired comparisons).
It achieves superior performance on key subtype classification (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951) while being 34.5x faster and requiring ~1000x fewer trainable parameters than probing frozen embeddings with a 3-layer MLP.
Mechanistic analysis reveals the algorithm's core is driven by four interpretable factors (T/lymphoid, B/plasma, granulocytic, monocyte/macrophage) explaining 66.2% of ablation impact, linking model internals to explicit biological programs.

研究空白： While foundation models for biology are powerful, they remain largely opaque 'black boxes'. The field lacks methods to extract, validate, and reuse the structured biological knowledge they encode as standalone, interpretable, and efficient algorithms.

摘要: We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT—to our knowledge, the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact (∼8–10-dimensional) hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel (616 anchors, 564,253 cells) and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel (trustworthiness 0.993, blocked-permutation p=0.0005). To isolate this geometry, we introduce a general three-stage extraction method—direct operator export from frozen attention weights, lightweight learned adaptor, and task-specific readout—that produces a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering (orientation-independent |ρ|=0.439 versus 0.331 for the next-best alternative; Wilcoxon BH-q≤2.7×10−7 on all paired comparisons) and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP (172k parameters), the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5× faster (∼3.4 versus ∼118 minutes) with ∼1,000× fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head (L2H5; 17.5→5.9 MB) without statistically significant loss, and further to a rank-64 surrogate (0.73 MB). Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.