Paper List

Theoretical Biology

A Theoretical Framework for the Formation of Large Animal Groups: Topological Coordination, Subgroup Merging, and Velocity Inheritance

2025-11-28

This paper addresses the core problem of how large, coordinated animal groups form in nature, challenging the classical view of gradual aggregation by...
Bioinformatics

CONFIDE: Hallucination Assessment for Reliable Biomolecular Structure Prediction and Design

2025-11-20

This paper addresses the critical limitation of current protein structure prediction models (like AlphaFold3) where high-confidence scores (pLDDT) can...
Bioinformatics

Generative design and validation of therapeutic peptides for glioblastoma based on a potential target ATP5A

2025-11-19

This paper addresses the critical bottleneck in therapeutic peptide design: how to efficiently optimize lead peptides with geometric constraints while...
Bioinformatics

Pharmacophore-based design by learning on voxel grids

2025-11-19

This paper addresses the computational bottleneck and limited novelty in conventional pharmacophore-based virtual screening by introducing a voxel cap...
Human-Computer Interaction

Human-Centred Evaluation of Text-to-Image Generation Models for Self-expression of Mental Distress: A Dataset Based on GPT-4o

2025-10-26

This paper addresses the critical gap in evaluating how AI-generated images can effectively support cross-cultural mental distress communication, part...
Bioinformatics

ANNE Apnea Paper

2025-03

This paper addresses the core challenge of achieving accurate, event-level sleep apnea detection and characterization using a non-intrusive, multimoda...
Bioinformatics

DeeDeeExperiment: Building an infrastructure for integrating and managing omics data analysis results in R/Bioconductor

2025

This paper addresses the critical bottleneck of managing and organizing the growing volume of differential expression and functional enrichment analys...
Bioinformatics

Cross-Species Antimicrobial Resistance Prediction from Genomic Foundation Models

2025

This paper addresses the core challenge of predicting antimicrobial resistance across phylogenetically distinct bacterial species, where traditional m...

17 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsDeep Learning

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

Department of Computer Science, Princeton University

Jonathan Liu, Kia Ghods

30秒速读

IN SHORT: This paper addresses the challenge of efficiently generating novel, cell-type-specific regulatory DNA sequences with high predicted activity while minimizing memorization of training data.

核心创新

Methodology Introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN input encoder for DNA sequence generation, achieving 60x faster convergence and 39% lower validation loss (0.023 vs. 0.037) compared to U-Net baselines.
Methodology Demonstrates a 38x improvement in predicted regulatory activity (Enformer scores) through DDPO finetuning using Enformer as a reward model, validated by cross-task generalization to DRAKES.
Biology Reduces sequence memorization from 5.3% (U-Net) to 1.7% (DiT) via BLAT alignment, while maintaining realistic motif usage (JS distance ~0.21-0.22), attributed to the transformer's global attention mechanism.

主要结论

The CNN encoder is critical for DiT performance; its removal increases validation loss by 70% (from 0.023 to 0.038-0.039), regardless of positional embedding choice (RoPE or learned).
DDPO finetuning boosts median predicted in-situ activity by 38x (e.g., from ~0.05 to ~4.76 in K562), with over 75% of generated sequences exceeding the baseline median across all cell types.
Cross-validation against DRAKES shows the model captures 70% (3.86/5.6) of the independent predictor's signal, confirming generalization beyond the reward model (Enformer).

研究空白： Existing diffusion models for regulatory DNA design (e.g., DNA-Diffusion) rely on U-Net backbones with fixed receptive fields, which struggle to model long-distance DNA interactions and exhibit higher memorization rates. Efficient, controllable generation of novel, functional sequences remains a bottleneck.

摘要: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200 bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion (DaSilva et al., 2025) with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.