Paper List

Biophysics

A Unified Variational Principle for Branching Transport Networks: Wave Impedance, Viscous Flow, and Tissue Metabolism

2026-03-16

This paper solves the core problem of predicting the empirically observed branching exponent (α≈2.7) in mammalian arterial trees, which neither Murray...
Epidemiology

Household Bubbling Strategies for Epidemic Control and Social Connectivity

2026-03-16

This paper addresses the core challenge of designing household merging (social bubble) strategies that effectively control epidemic risk while maximiz...
Bioinformatics

Empowering Chemical Structures with Biological Insights for Scalable Phenotypic Virtual Screening

2026-03-16

This paper addresses the core challenge of bridging the gap between scalable chemical structure screening and biologically informative but resource-in...
Biophysics

A mechanical bifurcation constrains the evolution of cell sheet folding in the family Volvocaceae

2026-03-16

This paper addresses the core problem of why there is an evolutionary gap in species with intermediate cell numbers (e.g., 256 cells) in Volvocaceae, ...
Epidemiology

Bayesian Inference in Epidemic Modelling: A Beginner’s Guide Illustrated with the SIR Model

2026-03-16

This guide addresses the core challenge of estimating uncertain epidemiological parameters (like transmission and recovery rates) from noisy, real-wor...
Theoretical Biology

Geometric framework for biological evolution

2026-03-16

This paper addresses the fundamental challenge of developing a coordinate-independent, geometric description of evolutionary dynamics that bridges gen...
Mathematical Biology

A multiscale discrete-to-continuum framework for structured population models

2026-03-16

This paper addresses the core challenge of systematically deriving uniformly valid continuum approximations from discrete structured population models...
Bioinformatics

Whole slide and microscopy image analysis with QuPath and OMERO

2026-03-16

使QuPath能够直接分析存储在OMERO服务器中的图像而无需下载整个数据集，克服了大规模研究的本地存储限制。

2 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsComputational Biology

How to make the most of your masked language model for protein engineering

BigHat Biosciences

Calvin McCarter, Nick Bhattacharya, Sebastian W. Ober, Hunter Elliott

30秒速读

IN SHORT: This paper addresses the critical bottleneck of efficiently sampling high-quality, diverse protein sequences from Masked Language Models (MLMs) for practical antibody engineering, where traditional mutation-centric methods are computationally expensive and often produce dysfunctional variants.

核心创新

Methodology Proposes a novel sequence-centric stochastic beam search (SBS) method that reframes generation as a search problem, leveraging MLMs' efficiency in evaluating the pseudo-log-likelihood (PLL) of all 1-edit neighbors of a sequence, achieving a 20EL× speedup over mutation-centric methods.
Methodology Introduces a flexible, gradient-free multi-objective optimization (MOO) framework compatible with the SBS sampler, enabling guidance by arbitrary black-box scoring functions (e.g., binding affinity, humanness, stability) without requiring differentiability or partially-masked sequence inputs.
Biology Provides the first extensive head-to-head in vitro evaluation of MLM sampling algorithms and models in real antibody therapeutic campaigns, revealing that the choice of sampling algorithm is at least as impactful as the choice of model itself.

主要结论

The proposed stochastic beam search sampler significantly outperformed traditional Gibbs sampling in vitro, with AbLang2+SBS achieving higher success rates (e.g., perfect 100% success rate when combined with Smooth Tchebycheff Scalarization guidance).
Model choice matters: ESM2-650M (trained on generic proteins) and AbLang2 (antibody-specific) performed best in silico and in vitro, while the sampling algorithm choice (SBS vs. Gibbs) had an equal or greater impact on outcome quality.
Supervision is highly effective: Using a trained classifier for post-MLM ranking improved the success rate of AbLang2 outputs considerably, and MOO guidance (NDS/STS) during generation further enhanced performance and eliminated generation of very weak binders.

研究空白： Despite the proliferation of protein language models, there is a significant lack of systematic research and benchmarking on how to best sample from them to generate functional protein variants for practical optimization tasks. Existing mutation-centric sampling methods are computationally costly (O(EL³)), struggle with non-differentiable scoring functions, and often produce low-likelihood sequences.

摘要: A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.