Paper List
-
MCP-AI: Protocol-Driven Intelligence Framework for Autonomous Reasoning in Healthcare
This paper addresses the critical gap in healthcare AI systems that lack contextual reasoning, long-term state management, and verifiable workflows by...
-
Model Gateway: Model Management Platform for Model-Driven Drug Discovery
This paper addresses the critical bottleneck of fragmented, ad-hoc model management in pharmaceutical research by providing a centralized, scalable ML...
-
Tree Thinking in the Genomic Era: Unifying Models Across Cells, Populations, and Species
This paper addresses the fragmentation of tree-based inference methods across biological scales by identifying shared algorithmic principles and stati...
-
SSDLabeler: Realistic semi-synthetic data generation for multi-label artifact classification in EEG
This paper addresses the core challenge of training robust multi-label EEG artifact classifiers by overcoming the scarcity and limited diversity of ma...
-
Decoding Selective Auditory Attention to Musical Elements in Ecologically Valid Music Listening
This paper addresses the core challenge of objectively quantifying listeners' selective attention to specific musical components (e.g., vocals, drums,...
-
Physics-Guided Surrogate Modeling for Machine Learning–Driven DLD Design Optimization
This paper addresses the core bottleneck of translating microfluidic DLD devices from research prototypes to clinical applications by replacing weeks-...
-
Mechanistic Interpretability of Antibody Language Models Using SAEs
This work addresses the core challenge of achieving both interpretability and controllable generation in domain-specific protein language models, spec...
-
Fluctuating Environments Favor Extreme Dormancy Strategies and Penalize Intermediate Ones
This paper addresses the core challenge of determining how organisms should tune dormancy duration to match the temporal autocorrelation of their envi...
Omics Data Discovery Agents
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
30秒速读
IN SHORT: This paper addresses the core challenge of making published omics data computationally reusable by automating the extraction, quantification, and integration of datasets scattered across unstructured literature and supplementary materials.
核心创新
- Methodology Introduces an LLM-agent framework with MCP servers that automates the entire pipeline from literature mining to data quantification and cross-study analysis.
- Methodology Demonstrates automated parameter extraction from article text for containerized quantification pipelines (MaxQuant/DIA-NN), achieving 63% overlap in differentially expressed proteins when matching preprocessing methods.
- Biology Identifies consistent protein regulation patterns (CLU, TGFBI, AMBP, MYH10, PRELP, Col14A1) across multiple liver fibrosis studies through automated cross-study comparison.
主要结论
- Achieved 80% precision for automated identification of datasets from standard repositories (PRIDE, MassIVE, GEO) across 39 proteomics articles.
- Demonstrated 63% overlap in differentially expressed proteins when agents matched article preprocessing methods, compared to 37% overlap without explicit instruction.
- Identified 6 consistently upregulated proteins (CLU, TGFBI, AMBP, MYH10, PRELP, Col14A1) across three independent liver fibrosis studies through automated cross-study analysis.
摘要: The biomedical literature contains a vast collection of omics studies, yet most published data remain functionally inaccessible for computational reuse. When raw data are deposited in public repositories, essential information for reproducing reported results is dispersed across main text, supplementary files, and code repositories. In rarer instances where intermediate data is made available (e.g. protein abundance files), its location is irregular. In this article, we present an agentic framework that fetches omics-related articles and transforms the unstructured information into searchable research objects. Our system employs large language model (LLM) agents with access to tools for fetching omics studies, extracting article metadata, identifying and downloading published data, executing containerized quantification pipelines, and running analyses to address novel question. We demonstrate automated metadata extraction from PubMed Central articles, achieving 80% precision for dataset identification from standard data repositories. Using model context protocol (MCP) servers to expose containerized analysis tools, our set of agents were able to identify a set of relevant articles, download the associated datasets, and re-quantify the proteomics data. The results had a 63% overlap in differentially expressed proteins when matching reported preprocessing methods. Furthermore, we show that agents can identify semantically similar studies, determine data compatibility, and perform cross-study comparisons, revealing consistent protein regulation patterns in liver fibrosis. This work establishes a foundation for converting the static biomedical literature into an executable, queryable resource that enables automated data reuse at scale.