Paper List

Systems Biology

Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study

2026-03-11

This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SIND...
Neuroimaging

Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical pp-Values for Voxel-Based Morphometry

2026-03-11

This paper addresses the computational bottleneck in voxel-based neuroimaging analysis by providing a method that delivers exact cluster-size retrieva...
Bioinformatics

abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance

2026-03-11

This paper addresses the critical challenge of quantitatively evaluating antibiotic prescribing policies under realistic uncertainty and partial obser...
Bioinformatics

PesTwin: a biology-informed Digital Twin for enabling precision farming

2026-03-11

This paper addresses the critical bottleneck in precision agriculture: the inability to accurately forecast pest outbreaks in real-time, leading to su...
Bioinformatics

Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation

2026-03-10

This paper addresses the core challenge of generating physically plausible 3D molecular structures by bridging the gap between autoregressive methods ...
Bioinformatics

Omics Data Discovery Agents

2026-03-10

This paper addresses the core challenge of making published omics data computationally reusable by automating the extraction, quantification, and inte...
Biophysics

Single-cell directional sensing at ultra-low chemoattractant concentrations from extreme first-passage events

2026-03-10

This work addresses the core challenge of how a cell can rapidly and accurately determine the direction of a chemoattractant source when the signal is...
Bioinformatics

SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction

2026-03-10

This paper addresses the computational bottleneck in reconstructing species trees from thousands of species and multiple genes by introducing a scalab...

8 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsGenomics

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

DEIB, Politecnico di Milano | Health Data Science Centre, Human Technopole | Genomics Research Centre, Human Technopole | MOX - Department of Mathematics, Politecnico di Milano | Department of Public Health and Primary Care, University of Cambridge

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio

30秒速读

IN SHORT: This paper addresses the core challenge of generating privacy-preserving synthetic genotype data that maintains both statistical fidelity and downstream predictive utility for supervised tasks like polygenic risk scoring.

核心创新

Methodology Introduces a two-stage conditional latent diffusion framework combining GWAS-guided variant selection (1,024–2,048 SNPs) with VAE compression and phenotype-conditioned generation via classifier-free guidance.
Methodology Implements phenotype-supervised generation rather than unconditional sampling, producing synthetic genotypes directly usable for downstream disease prediction tasks without additional phenotype mechanisms.
Biology Demonstrates that GWAS-guided selection of trait-associated SNPs preserves predictive performance comparable to genome-wide methods while using 2–6× fewer variants, offering a favorable computational trade-off.

主要结论

Models trained on synthetic data matched real-data predictive performance across four complex diseases (CAD, BC, T1D, T2D) in TSTR protocols, with synthetic XGBoost achieving AUCs of 0.587±0.019 for T2D and 0.594±0.011 for CAD, closely matching real-data performance.
Privacy analysis showed zero identical matches, near-random membership inference (AUC ≈ 0.50), preserved LD structure, and high allele frequency correlation (r≥0.95) with source data, confirming strong privacy guarantees.
In controlled simulations with known causal effects, synthetic data showed strong agreement with real-data effect estimates (Pearson r=0.835), exceeding VAE-reconstructed data (r=0.726), demonstrating faithful recovery of genetic association structures.

研究空白： Existing synthetic genotype generators either operate unconditionally (without phenotype alignment) or rely on unsupervised compression methods that prioritize population structure over genotype-phenotype relationships, creating a disconnect between statistical fidelity and downstream task utility.

摘要: Motivation: Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally—producing samples without phenotype alignment—or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. Results: We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024–2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use 2–6× more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC ≈ 0.50), preserved linkage disequilibrium structure, and high allele frequency correlation (r≥0.95) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure. Availability and implementation: Code available at https://github.com/ht-diva/SNPgen. Contact: andrea.lampis@polimi.it Supplementary information: Supplementary data are available in the Appendix.

代码