Paper List
-
Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study
This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SIND...
-
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical pp-Values for Voxel-Based Morphometry
This paper addresses the computational bottleneck in voxel-based neuroimaging analysis by providing a method that delivers exact cluster-size retrieva...
-
abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance
This paper addresses the critical challenge of quantitatively evaluating antibiotic prescribing policies under realistic uncertainty and partial obser...
-
PesTwin: a biology-informed Digital Twin for enabling precision farming
This paper addresses the critical bottleneck in precision agriculture: the inability to accurately forecast pest outbreaks in real-time, leading to su...
-
Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation
This paper addresses the core challenge of generating physically plausible 3D molecular structures by bridging the gap between autoregressive methods ...
-
Omics Data Discovery Agents
This paper addresses the core challenge of making published omics data computationally reusable by automating the extraction, quantification, and inte...
-
Single-cell directional sensing at ultra-low chemoattractant concentrations from extreme first-passage events
This work addresses the core challenge of how a cell can rapidly and accurately determine the direction of a chemoattractant source when the signal is...
-
SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction
This paper addresses the computational bottleneck in reconstructing species trees from thousands of species and multiple genes by introducing a scalab...
How to make the most of your masked language model for protein engineering
BigHat Biosciences
30秒速读
IN SHORT: This paper addresses the critical bottleneck of efficiently sampling high-quality, diverse protein sequences from Masked Language Models (MLMs) for practical antibody engineering, where traditional mutation-centric methods are computationally expensive and often produce dysfunctional variants.
核心创新
- Methodology Proposes a novel sequence-centric stochastic beam search (SBS) method that reframes generation as a search problem, leveraging MLMs' efficiency in evaluating the pseudo-log-likelihood (PLL) of all 1-edit neighbors of a sequence, achieving a 20EL× speedup over mutation-centric methods.
- Methodology Introduces a flexible, gradient-free multi-objective optimization (MOO) framework compatible with the SBS sampler, enabling guidance by arbitrary black-box scoring functions (e.g., binding affinity, humanness, stability) without requiring differentiability or partially-masked sequence inputs.
- Biology Provides the first extensive head-to-head in vitro evaluation of MLM sampling algorithms and models in real antibody therapeutic campaigns, revealing that the choice of sampling algorithm is at least as impactful as the choice of model itself.
主要结论
- The proposed stochastic beam search sampler significantly outperformed traditional Gibbs sampling in vitro, with AbLang2+SBS achieving higher success rates (e.g., perfect 100% success rate when combined with Smooth Tchebycheff Scalarization guidance).
- Model choice matters: ESM2-650M (trained on generic proteins) and AbLang2 (antibody-specific) performed best in silico and in vitro, while the sampling algorithm choice (SBS vs. Gibbs) had an equal or greater impact on outcome quality.
- Supervision is highly effective: Using a trained classifier for post-MLM ranking improved the success rate of AbLang2 outputs considerably, and MOO guidance (NDS/STS) during generation further enhanced performance and eliminated generation of very weak binders.
摘要: A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.