Paper List
-
SpikGPT: A High-Accuracy and Interpretable Spiking Attention Framework for Single-Cell Annotation
This paper addresses the core challenge of robust single-cell annotation across heterogeneous datasets with batch effects and the critical need to ide...
-
Unlocking hidden biomolecular conformational landscapes in diffusion models at inference time
This paper addresses the core challenge of efficiently and accurately sampling the conformational landscape of biomolecules from diffusion-based struc...
-
Personalized optimization of pediatric HD-tDCS for dose consistency and target engagement
This paper addresses the critical limitation of one-size-fits-all HD-tDCS protocols in pediatric populations by developing a personalized optimization...
-
Realistic Transition Paths for Large Biomolecular Systems: A Langevin Bridge Approach
This paper addresses the core challenge of generating physically realistic and computationally efficient transition paths between distinct protein con...
-
Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
This paper addresses the core pain point of low sequence-structure alignment in existing synthetic datasets (e.g., AFDB), which severely limits the pe...
-
MoRSAIK: Sequence Motif Reactor Simulation, Analysis and Inference Kit in Python
This work addresses the computational bottleneck in simulating prebiotic RNA reactor dynamics by developing a Python package that tracks sequence moti...
-
On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks
This paper addresses the core challenge of developing computationally efficient and scalable neural network architectures that can learn accurate phyl...
-
EcoCast: A Spatio-Temporal Model for Continual Biodiversity and Climate Risk Forecasting
This paper addresses the critical bottleneck in conservation: the lack of timely, high-resolution, near-term forecasts of species distribution shifts ...
Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals
Department of Computer Science, University of Tübingen, Tübingen, Germany
30秒速读
IN SHORT: This work addresses the core challenge of extracting reusable, interpretable, and high-performance biological algorithms from the opaque internal representations of single-cell foundation models.
核心创新
- Methodology Introduces a three-stage pipeline (direct operator export, lightweight adaptor, task readout) to extract standalone algorithms from frozen foundation model weights without target-dataset retraining.
- Biology Discovers a compact (~8-10D) hematopoietic manifold within scGPT's attention geometry, validated with high trustworthiness (0.993) and significant developmental branch structure (e.g., erythroid trajectory ρ=0.768, p=0.0017).
- Methodology Demonstrates multi-stage model compression, reducing the operator from 17.5 MB to 0.73 MB without statistically significant performance loss, and provides mechanistic interpretability via a four-factor core explaining 66.2% of ablation impact.
主要结论
- The extracted algorithm significantly outperforms established baselines (scVI, Palantir, DPT, etc.) on pseudotime-depth ordering (orientation-independent |ρ|=0.439 vs. 0.331 for next-best; Wilcoxon BH-q≤2.7×10−7 on all paired comparisons).
- It achieves superior performance on key subtype classification (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951) while being 34.5x faster and requiring ~1000x fewer trainable parameters than probing frozen embeddings with a 3-layer MLP.
- Mechanistic analysis reveals the algorithm's core is driven by four interpretable factors (T/lymphoid, B/plasma, granulocytic, monocyte/macrophage) explaining 66.2% of ablation impact, linking model internals to explicit biological programs.
摘要: We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT—to our knowledge, the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact (∼8–10-dimensional) hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel (616 anchors, 564,253 cells) and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel (trustworthiness 0.993, blocked-permutation p=0.0005). To isolate this geometry, we introduce a general three-stage extraction method—direct operator export from frozen attention weights, lightweight learned adaptor, and task-specific readout—that produces a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering (orientation-independent |ρ|=0.439 versus 0.331 for the next-best alternative; Wilcoxon BH-q≤2.7×10−7 on all paired comparisons) and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP (172k parameters), the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5× faster (∼3.4 versus ∼118 minutes) with ∼1,000× fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head (L2H5; 17.5→5.9 MB) without statistically significant loss, and further to a rank-64 surrogate (0.73 MB). Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.