Paper List
-
Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era
This paper addresses the core pain point of efficiently extracting standardized, comparable features from massive (terabyte to petabyte-scale) biomedi...
-
Topological Enhancement of Protein Kinetic Stability
This work addresses the long-standing puzzle of why knotted proteins exist by demonstrating that deep knots provide a functional advantage through enh...
-
A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization
This paper addresses the critical limitation of existing TF binding prediction methods that treat transcription factors as independent entities, faili...
-
Social Distancing Equilibria in Games under Conventional SI Dynamics
This paper solves the core problem of proving the existence and uniqueness of Nash equilibria in finite-duration SI epidemic games, showing they are a...
-
Binding Free Energies without Alchemy
This paper addresses the core bottleneck of computational expense in Absolute Binding Free Energy calculations by eliminating the need for numerous al...
-
SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules
This paper addresses the core bottleneck in cryo-EM helical reconstruction: eliminating the dependency on accurate initial symmetry parameter estimati...
-
Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection
This paper addresses the critical gap in evaluating AI-guided scientific selection strategies under realistic budget constraints, where existing metri...
-
Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration
This paper addresses the core challenge of accurately decomposing shared (joint) and dataset-specific (individual) sources of variation in multi-modal...
Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements
Department of Computer Science, Princeton University
30秒速读
IN SHORT: This paper addresses the challenge of efficiently generating novel, cell-type-specific regulatory DNA sequences with high predicted activity while minimizing memorization of training data.
核心创新
- Methodology Introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN input encoder for DNA sequence generation, achieving 60x faster convergence and 39% lower validation loss (0.023 vs. 0.037) compared to U-Net baselines.
- Methodology Demonstrates a 38x improvement in predicted regulatory activity (Enformer scores) through DDPO finetuning using Enformer as a reward model, validated by cross-task generalization to DRAKES.
- Biology Reduces sequence memorization from 5.3% (U-Net) to 1.7% (DiT) via BLAT alignment, while maintaining realistic motif usage (JS distance ~0.21-0.22), attributed to the transformer's global attention mechanism.
主要结论
- The CNN encoder is critical for DiT performance; its removal increases validation loss by 70% (from 0.023 to 0.038-0.039), regardless of positional embedding choice (RoPE or learned).
- DDPO finetuning boosts median predicted in-situ activity by 38x (e.g., from ~0.05 to ~4.76 in K562), with over 75% of generated sequences exceeding the baseline median across all cell types.
- Cross-validation against DRAKES shows the model captures 70% (3.86/5.6) of the independent predictor's signal, confirming generalization beyond the reward model (Enformer).
摘要: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200 bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion (DaSilva et al., 2025) with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.