Paper List

Bioinformatics

Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era

2026-03-12

This paper addresses the core pain point of efficiently extracting standardized, comparable features from massive (terabyte to petabyte-scale) biomedi...
Biophysics

Topological Enhancement of Protein Kinetic Stability

2026-03-12

This work addresses the long-standing puzzle of why knotted proteins exist by demonstrating that deep knots provide a functional advantage through enh...
Bioinformatics

A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization

2026-03-12

This paper addresses the critical limitation of existing TF binding prediction methods that treat transcription factors as independent entities, faili...
Mathematical Biology

Social Distancing Equilibria in Games under Conventional SI Dynamics

2026-03-12

This paper solves the core problem of proving the existence and uniqueness of Nash equilibria in finite-duration SI epidemic games, showing they are a...
Computational Chemistry

Binding Free Energies without Alchemy

2026-03-12

This paper addresses the core bottleneck of computational expense in Absolute Binding Free Energy calculations by eliminating the need for numerous al...
Structural Biology

SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules

2026-03-12

This paper addresses the core bottleneck in cryo-EM helical reconstruction: eliminating the dependency on accurate initial symmetry parameter estimati...
Bioinformatics

Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

2026-03-12

This paper addresses the critical gap in evaluating AI-guided scientific selection strategies under realistic budget constraints, where existing metri...
Bioinformatics

Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

2026-03-12

This paper addresses the core challenge of accurately decomposing shared (joint) and dataset-specific (individual) sources of variation in multi-modal...

5 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsDeep Learning

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

Department of Computer Science, Princeton University

Jonathan Liu, Kia Ghods

30秒速读

IN SHORT: This paper addresses the challenge of efficiently generating novel, cell-type-specific regulatory DNA sequences with high predicted activity while minimizing memorization of training data.

核心创新

Methodology Introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN input encoder for DNA sequence generation, achieving 60x faster convergence and 39% lower validation loss (0.023 vs. 0.037) compared to U-Net baselines.
Methodology Demonstrates a 38x improvement in predicted regulatory activity (Enformer scores) through DDPO finetuning using Enformer as a reward model, validated by cross-task generalization to DRAKES.
Biology Reduces sequence memorization from 5.3% (U-Net) to 1.7% (DiT) via BLAT alignment, while maintaining realistic motif usage (JS distance ~0.21-0.22), attributed to the transformer's global attention mechanism.

主要结论

The CNN encoder is critical for DiT performance; its removal increases validation loss by 70% (from 0.023 to 0.038-0.039), regardless of positional embedding choice (RoPE or learned).
DDPO finetuning boosts median predicted in-situ activity by 38x (e.g., from ~0.05 to ~4.76 in K562), with over 75% of generated sequences exceeding the baseline median across all cell types.
Cross-validation against DRAKES shows the model captures 70% (3.86/5.6) of the independent predictor's signal, confirming generalization beyond the reward model (Enformer).

研究空白： Existing diffusion models for regulatory DNA design (e.g., DNA-Diffusion) rely on U-Net backbones with fixed receptive fields, which struggle to model long-distance DNA interactions and exhibit higher memorization rates. Efficient, controllable generation of novel, functional sequences remains a bottleneck.

摘要: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200 bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion (DaSilva et al., 2025) with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.