Paper List

Bioinformatics

Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

2026-03-10

This work addresses the core challenge of extracting reusable, interpretable, and high-performance biological algorithms from the opaque internal repr...
Bioinformatics

MS2MetGAN: Latent-space adversarial training for metabolite–spectrum matching in MS/MS database search

2026-03-07

This paper addresses the critical bottleneck in metabolite identification: the generation of high-quality negative training samples that are structura...
Neuroscience

Toward Robust, Reproducible, and Widely Accessible Intracranial Language Brain-Computer Interfaces: A Comprehensive Review of Neural Mechanisms, Hardware, Algorithms, Evaluation, Clinical Pathways and Future Directions

2026-03-03

This review addresses the core challenge of fragmented and heterogeneous evidence that hinders the clinical translation of intracranial language BCIs,...
Mathematical Biology

Less Is More in Chemotherapy of Breast Cancer

2026-03-03

通过纳入细胞周期时滞和竞争项，解决了现有肿瘤-免疫模型的过度简化问题，以定量比较化疗方案。
Bioinformatics

Fold-CP: A Context Parallelism Framework for Biomolecular Modeling

2026-03

This paper addresses the critical bottleneck of GPU memory limitations that restrict AlphaFold 3-like models to processing only a few thousand residue...
Bioinformatics

Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

2026-03

This paper addresses the core pain point of fragmented biomedical data by constructing and federating large-scale, open knowledge graphs to enable sea...
Bioinformatics

Predictive Analytics for Foot Ulcers Using Time-Series Temperature and Pressure Data

2026-02-27

This paper addresses the critical need for continuous, real-time monitoring of diabetic foot health by developing an unsupervised anomaly detection fr...
Bioinformatics

Hypothesis-Based Particle Detection for Accurate Nanoparticle Counting and Digital Diagnostics

2025-12-05

This paper addresses the core challenge of achieving accurate, interpretable, and training-free nanoparticle counting in digital diagnostic assays, wh...

9 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-10

BioinformaticsAI/ML

Omics Data Discovery Agents

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA

Alexandre Hutton, Jesse G. Meyer

30秒速读

IN SHORT: This paper addresses the core challenge of making published omics data computationally reusable by automating the extraction, quantification, and integration of datasets scattered across unstructured literature and supplementary materials.

核心创新

Methodology Introduces an LLM-agent framework with MCP servers that automates the entire pipeline from literature mining to data quantification and cross-study analysis.
Methodology Demonstrates automated parameter extraction from article text for containerized quantification pipelines (MaxQuant/DIA-NN), achieving 63% overlap in differentially expressed proteins when matching preprocessing methods.
Biology Identifies consistent protein regulation patterns (CLU, TGFBI, AMBP, MYH10, PRELP, Col14A1) across multiple liver fibrosis studies through automated cross-study comparison.

主要结论

Achieved 80% precision for automated identification of datasets from standard repositories (PRIDE, MassIVE, GEO) across 39 proteomics articles.
Demonstrated 63% overlap in differentially expressed proteins when agents matched article preprocessing methods, compared to 37% overlap without explicit instruction.
Identified 6 consistently upregulated proteins (CLU, TGFBI, AMBP, MYH10, PRELP, Col14A1) across three independent liver fibrosis studies through automated cross-study analysis.

研究空白： Current literature mining tools focus on metadata linking or summarization but lack the capability to autonomously identify datasets, extract processing parameters, and execute reproducible analyses at scale, leaving most published omics data functionally inaccessible.

摘要: The biomedical literature contains a vast collection of omics studies, yet most published data remain functionally inaccessible for computational reuse. When raw data are deposited in public repositories, essential information for reproducing reported results is dispersed across main text, supplementary files, and code repositories. In rarer instances where intermediate data is made available (e.g. protein abundance files), its location is irregular. In this article, we present an agentic framework that fetches omics-related articles and transforms the unstructured information into searchable research objects. Our system employs large language model (LLM) agents with access to tools for fetching omics studies, extracting article metadata, identifying and downloading published data, executing containerized quantification pipelines, and running analyses to address novel question. We demonstrate automated metadata extraction from PubMed Central articles, achieving 80% precision for dataset identification from standard data repositories. Using model context protocol (MCP) servers to expose containerized analysis tools, our set of agents were able to identify a set of relevant articles, download the associated datasets, and re-quantify the proteomics data. The results had a 63% overlap in differentially expressed proteins when matching reported preprocessing methods. Furthermore, we show that agents can identify semantically similar studies, determine data compatibility, and perform cross-study comparisons, revealing consistent protein regulation patterns in liver fibrosis. This work establishes a foundation for converting the static biomedical literature into an executable, queryable resource that enables automated data reuse at scale.