Paper List

Bioinformatics

Mapping of Lesion Images to Somatic Mutations

2020-10-19

This paper addresses the critical bottleneck of delayed genetic analysis in cancer diagnosis by predicting a patient's full somatic mutation profile d...
Artificial Intelligence

Reinventing Clinical Dialogue: Agentic Paradigms for LLM‑Enabled Healthcare Communication

2018-08-01

This paper addresses the core challenge of transforming reactive, stateless LLMs into autonomous, reliable clinical dialogue agents capable of longitu...
Bioinformatics

Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization

2018-06-03

通过将序列映射到二元潜在空间进行基于QUBO的适应度优化，桥接蛋白质表示学习和组合优化。
Bio-inspired Robotics

Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

2018-02-09

证明了无模型强化学习可以利用虚拟视觉刺激有效引导鱼群，克服了缺乏精确行为模型的问题。

18 / 18
»

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsDeep Learning

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

Department of Computer Science, Princeton University

Jonathan Liu, Kia Ghods

30秒速读

IN SHORT: This paper addresses the challenge of efficiently generating novel, cell-type-specific regulatory DNA sequences with high predicted activity while minimizing memorization of training data.

核心创新

Methodology Introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN input encoder for DNA sequence generation, achieving 60x faster convergence and 39% lower validation loss (0.023 vs. 0.037) compared to U-Net baselines.
Methodology Demonstrates a 38x improvement in predicted regulatory activity (Enformer scores) through DDPO finetuning using Enformer as a reward model, validated by cross-task generalization to DRAKES.
Biology Reduces sequence memorization from 5.3% (U-Net) to 1.7% (DiT) via BLAT alignment, while maintaining realistic motif usage (JS distance ~0.21-0.22), attributed to the transformer's global attention mechanism.

主要结论

The CNN encoder is critical for DiT performance; its removal increases validation loss by 70% (from 0.023 to 0.038-0.039), regardless of positional embedding choice (RoPE or learned).
DDPO finetuning boosts median predicted in-situ activity by 38x (e.g., from ~0.05 to ~4.76 in K562), with over 75% of generated sequences exceeding the baseline median across all cell types.
Cross-validation against DRAKES shows the model captures 70% (3.86/5.6) of the independent predictor's signal, confirming generalization beyond the reward model (Enformer).

研究空白： Existing diffusion models for regulatory DNA design (e.g., DNA-Diffusion) rely on U-Net backbones with fixed receptive fields, which struggle to model long-distance DNA interactions and exhibit higher memorization rates. Efficient, controllable generation of novel, functional sequences remains a bottleneck.

摘要: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200 bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion (DaSilva et al., 2025) with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.

Paper List

Mapping of Lesion Images to Somatic Mutations

Reinventing Clinical Dialogue: Agentic Paradigms for LLM‑Enabled Healthcare Communication

Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization

Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

30秒速读

核心创新

主要结论