Paper List
-
Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study
This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SIND...
-
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical pp-Values for Voxel-Based Morphometry
This paper addresses the computational bottleneck in voxel-based neuroimaging analysis by providing a method that delivers exact cluster-size retrieva...
-
abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance
This paper addresses the critical challenge of quantitatively evaluating antibiotic prescribing policies under realistic uncertainty and partial obser...
-
PesTwin: a biology-informed Digital Twin for enabling precision farming
This paper addresses the critical bottleneck in precision agriculture: the inability to accurately forecast pest outbreaks in real-time, leading to su...
-
Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation
This paper addresses the core challenge of generating physically plausible 3D molecular structures by bridging the gap between autoregressive methods ...
-
Omics Data Discovery Agents
This paper addresses the core challenge of making published omics data computationally reusable by automating the extraction, quantification, and inte...
-
Single-cell directional sensing at ultra-low chemoattractant concentrations from extreme first-passage events
This work addresses the core challenge of how a cell can rapidly and accurately determine the direction of a chemoattractant source when the signal is...
-
SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction
This paper addresses the computational bottleneck in reconstructing species trees from thousands of species and multiple genes by introducing a scalab...
Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals
Department of Computer Science, University of Tübingen, Tübingen, Germany
30秒速读
IN SHORT: This work addresses the core challenge of extracting reusable, interpretable, and high-performance biological algorithms from the opaque internal representations of single-cell foundation models.
核心创新
- Methodology Introduces a three-stage pipeline (direct operator export, lightweight adaptor, task readout) to extract standalone algorithms from frozen foundation model weights without target-dataset retraining.
- Biology Discovers a compact (~8-10D) hematopoietic manifold within scGPT's attention geometry, validated with high trustworthiness (0.993) and significant developmental branch structure (e.g., erythroid trajectory ρ=0.768, p=0.0017).
- Methodology Demonstrates multi-stage model compression, reducing the operator from 17.5 MB to 0.73 MB without statistically significant performance loss, and provides mechanistic interpretability via a four-factor core explaining 66.2% of ablation impact.
主要结论
- The extracted algorithm significantly outperforms established baselines (scVI, Palantir, DPT, etc.) on pseudotime-depth ordering (orientation-independent |ρ|=0.439 vs. 0.331 for next-best; Wilcoxon BH-q≤2.7×10−7 on all paired comparisons).
- It achieves superior performance on key subtype classification (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951) while being 34.5x faster and requiring ~1000x fewer trainable parameters than probing frozen embeddings with a 3-layer MLP.
- Mechanistic analysis reveals the algorithm's core is driven by four interpretable factors (T/lymphoid, B/plasma, granulocytic, monocyte/macrophage) explaining 66.2% of ablation impact, linking model internals to explicit biological programs.
摘要: We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT—to our knowledge, the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact (∼8–10-dimensional) hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel (616 anchors, 564,253 cells) and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel (trustworthiness 0.993, blocked-permutation p=0.0005). To isolate this geometry, we introduce a general three-stage extraction method—direct operator export from frozen attention weights, lightweight learned adaptor, and task-specific readout—that produces a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering (orientation-independent |ρ|=0.439 versus 0.331 for the next-best alternative; Wilcoxon BH-q≤2.7×10−7 on all paired comparisons) and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP (172k parameters), the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5× faster (∼3.4 versus ∼118 minutes) with ∼1,000× fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head (L2H5; 17.5→5.9 MB) without statistically significant loss, and further to a rank-64 surrogate (0.73 MB). Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.