Paper List
-
GOPHER: Optimization-based Phenotype Randomization for Genome-Wide Association Studies with Differential Privacy
This paper addresses the core challenge of balancing rigorous privacy protection with data utility when releasing full GWAS summary statistics, overco...
-
Real-time Cricket Sorting By Sex A low-cost embedded solution using YOLOv8 and Raspberry Pi
This paper addresses the critical bottleneck in industrial insect farming: the lack of automated, real-time sex sorting systems for Acheta domesticus ...
-
Training Dynamics of Learning 3D-Rotational Equivariance
This work addresses the core dilemma of whether to use computationally expensive equivariant architectures or faster symmetry-agnostic models with dat...
-
Fast and Accurate Node-Age Estimation Under Fossil Calibration Uncertainty Using the Adjusted Pairwise Likelihood
This paper addresses the dual challenge of computational inefficiency and sensitivity to fossil calibration errors in Bayesian divergence time estimat...
-
Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training
This paper addresses the core challenge of accurately predicting protein fitness with only a handful of experimental observations, where data collecti...
-
scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing
This paper addresses the critical gap of fragmented and non-standardized benchmarking in single-cell RNA-seq clustering, which hinders objective compa...
-
Simulation and inference methods for non-Markovian stochastic biochemical reaction networks
This paper addresses the computational bottleneck of simulating and performing Bayesian inference for non-Markovian biochemical systems with history-d...
-
Assessment of Simulation-based Inference Methods for Stochastic Compartmental Models
This paper addresses the core challenge of performing accurate Bayesian parameter inference for stochastic epidemic models when the likelihood functio...
Omics Data Discovery Agents
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
30秒速读
IN SHORT: This paper addresses the core challenge of making published omics data computationally reusable by automating the extraction, quantification, and integration of datasets scattered across unstructured literature and supplementary materials.
核心创新
- Methodology Introduces an LLM-agent framework with MCP servers that automates the entire pipeline from literature mining to data quantification and cross-study analysis.
- Methodology Demonstrates automated parameter extraction from article text for containerized quantification pipelines (MaxQuant/DIA-NN), achieving 63% overlap in differentially expressed proteins when matching preprocessing methods.
- Biology Identifies consistent protein regulation patterns (CLU, TGFBI, AMBP, MYH10, PRELP, Col14A1) across multiple liver fibrosis studies through automated cross-study comparison.
主要结论
- Achieved 80% precision for automated identification of datasets from standard repositories (PRIDE, MassIVE, GEO) across 39 proteomics articles.
- Demonstrated 63% overlap in differentially expressed proteins when agents matched article preprocessing methods, compared to 37% overlap without explicit instruction.
- Identified 6 consistently upregulated proteins (CLU, TGFBI, AMBP, MYH10, PRELP, Col14A1) across three independent liver fibrosis studies through automated cross-study analysis.
摘要: The biomedical literature contains a vast collection of omics studies, yet most published data remain functionally inaccessible for computational reuse. When raw data are deposited in public repositories, essential information for reproducing reported results is dispersed across main text, supplementary files, and code repositories. In rarer instances where intermediate data is made available (e.g. protein abundance files), its location is irregular. In this article, we present an agentic framework that fetches omics-related articles and transforms the unstructured information into searchable research objects. Our system employs large language model (LLM) agents with access to tools for fetching omics studies, extracting article metadata, identifying and downloading published data, executing containerized quantification pipelines, and running analyses to address novel question. We demonstrate automated metadata extraction from PubMed Central articles, achieving 80% precision for dataset identification from standard data repositories. Using model context protocol (MCP) servers to expose containerized analysis tools, our set of agents were able to identify a set of relevant articles, download the associated datasets, and re-quantify the proteomics data. The results had a 63% overlap in differentially expressed proteins when matching reported preprocessing methods. Furthermore, we show that agents can identify semantically similar studies, determine data compatibility, and perform cross-study comparisons, revealing consistent protein regulation patterns in liver fibrosis. This work establishes a foundation for converting the static biomedical literature into an executable, queryable resource that enables automated data reuse at scale.