Paper List
-
MCP-AI: Protocol-Driven Intelligence Framework for Autonomous Reasoning in Healthcare
This paper addresses the critical gap in healthcare AI systems that lack contextual reasoning, long-term state management, and verifiable workflows by...
-
Model Gateway: Model Management Platform for Model-Driven Drug Discovery
This paper addresses the critical bottleneck of fragmented, ad-hoc model management in pharmaceutical research by providing a centralized, scalable ML...
-
Tree Thinking in the Genomic Era: Unifying Models Across Cells, Populations, and Species
This paper addresses the fragmentation of tree-based inference methods across biological scales by identifying shared algorithmic principles and stati...
-
SSDLabeler: Realistic semi-synthetic data generation for multi-label artifact classification in EEG
This paper addresses the core challenge of training robust multi-label EEG artifact classifiers by overcoming the scarcity and limited diversity of ma...
-
Decoding Selective Auditory Attention to Musical Elements in Ecologically Valid Music Listening
This paper addresses the core challenge of objectively quantifying listeners' selective attention to specific musical components (e.g., vocals, drums,...
-
Physics-Guided Surrogate Modeling for Machine Learning–Driven DLD Design Optimization
This paper addresses the core bottleneck of translating microfluidic DLD devices from research prototypes to clinical applications by replacing weeks-...
-
Mechanistic Interpretability of Antibody Language Models Using SAEs
This work addresses the core challenge of achieving both interpretability and controllable generation in domain-specific protein language models, spec...
-
Fluctuating Environments Favor Extreme Dormancy Strategies and Penalize Intermediate Ones
This paper addresses the core challenge of determining how organisms should tune dormancy duration to match the temporal autocorrelation of their envi...
Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements
Department of Computer Science, Princeton University
30秒速读
IN SHORT: This paper addresses the challenge of efficiently generating novel, cell-type-specific regulatory DNA sequences with high predicted activity while minimizing memorization of training data.
核心创新
- Methodology Introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN input encoder for DNA sequence generation, achieving 60x faster convergence and 39% lower validation loss (0.023 vs. 0.037) compared to U-Net baselines.
- Methodology Demonstrates a 38x improvement in predicted regulatory activity (Enformer scores) through DDPO finetuning using Enformer as a reward model, validated by cross-task generalization to DRAKES.
- Biology Reduces sequence memorization from 5.3% (U-Net) to 1.7% (DiT) via BLAT alignment, while maintaining realistic motif usage (JS distance ~0.21-0.22), attributed to the transformer's global attention mechanism.
主要结论
- The CNN encoder is critical for DiT performance; its removal increases validation loss by 70% (from 0.023 to 0.038-0.039), regardless of positional embedding choice (RoPE or learned).
- DDPO finetuning boosts median predicted in-situ activity by 38x (e.g., from ~0.05 to ~4.76 in K562), with over 75% of generated sequences exceeding the baseline median across all cell types.
- Cross-validation against DRAKES shows the model captures 70% (3.86/5.6) of the independent predictor's signal, confirming generalization beyond the reward model (Enformer).
摘要: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200 bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion (DaSilva et al., 2025) with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.