Paper List

Health Informatics

An AI Implementation Science Study to Improve Trustworthy Data in a Large Healthcare System

2025-12-01

This paper addresses the critical gap between theoretical AI research and real-world clinical implementation by providing a practical framework for as...
Bioinformatics

The BEAT-CF Causal Model: A model for guiding the design of trials and observational analyses of cystic fibrosis exacerbations

2025-12

This paper addresses the critical gap in cystic fibrosis exacerbation management by providing a formal causal framework that integrates expert knowled...
Bioinformatics

Hierarchical Molecular Language Models (HMLMs)

2025-11-30

This paper addresses the core challenge of accurately modeling context-dependent signaling, pathway cross-talk, and temporal dynamics across multiple ...
Computational Neuroscience

Stability analysis of action potential generation using Markov models of voltage‑gated sodium channel isoforms

2025-11-30

This work addresses the challenge of systematically characterizing how the high-dimensional parameter space of Markov models for different sodium chan...
Network Science

Approximate Bayesian Inference on Mechanisms of Network Growth and Evolution

2025-11-30

This paper addresses the core challenge of inferring the relative contributions of multiple, simultaneous generative mechanisms in network formation w...
Bioinformatics

EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants

2025-11-29

This paper addresses the core challenge of jointly predicting enzyme kinetic parameters (Kcat and Km) by modeling dynamic enzyme-substrate interaction...
Biophysics

Tissue stress measurements with Bayesian Inversion Stress Microscopy

2025-11-29

This paper addresses the core challenge of measuring absolute, tissue-scale mechanical stress without making assumptions about tissue rheology, which ...
Bioinformatics

DeepFRI Demystified: Interpretability vs. Accuracy in AI Protein Function Prediction

2025-11-29

This study addresses the critical gap between high predictive accuracy and biological interpretability in DeepFRI, revealing that the model often prio...

16 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsDeep Learning

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

Department of Computer Science, Princeton University

Jonathan Liu, Kia Ghods

30秒速读

IN SHORT: This paper addresses the challenge of efficiently generating novel, cell-type-specific regulatory DNA sequences with high predicted activity while minimizing memorization of training data.

核心创新

Methodology Introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN input encoder for DNA sequence generation, achieving 60x faster convergence and 39% lower validation loss (0.023 vs. 0.037) compared to U-Net baselines.
Methodology Demonstrates a 38x improvement in predicted regulatory activity (Enformer scores) through DDPO finetuning using Enformer as a reward model, validated by cross-task generalization to DRAKES.
Biology Reduces sequence memorization from 5.3% (U-Net) to 1.7% (DiT) via BLAT alignment, while maintaining realistic motif usage (JS distance ~0.21-0.22), attributed to the transformer's global attention mechanism.

主要结论

The CNN encoder is critical for DiT performance; its removal increases validation loss by 70% (from 0.023 to 0.038-0.039), regardless of positional embedding choice (RoPE or learned).
DDPO finetuning boosts median predicted in-situ activity by 38x (e.g., from ~0.05 to ~4.76 in K562), with over 75% of generated sequences exceeding the baseline median across all cell types.
Cross-validation against DRAKES shows the model captures 70% (3.86/5.6) of the independent predictor's signal, confirming generalization beyond the reward model (Enformer).

研究空白： Existing diffusion models for regulatory DNA design (e.g., DNA-Diffusion) rely on U-Net backbones with fixed receptive fields, which struggle to model long-distance DNA interactions and exhibit higher memorization rates. Efficient, controllable generation of novel, functional sequences remains a bottleneck.

摘要: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200 bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion (DaSilva et al., 2025) with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.