Paper List

Bioinformatics

Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era

2026-03-12

This paper addresses the core pain point of efficiently extracting standardized, comparable features from massive (terabyte to petabyte-scale) biomedi...
Biophysics

Topological Enhancement of Protein Kinetic Stability

2026-03-12

This work addresses the long-standing puzzle of why knotted proteins exist by demonstrating that deep knots provide a functional advantage through enh...
Bioinformatics

A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization

2026-03-12

This paper addresses the critical limitation of existing TF binding prediction methods that treat transcription factors as independent entities, faili...
Mathematical Biology

Social Distancing Equilibria in Games under Conventional SI Dynamics

2026-03-12

This paper solves the core problem of proving the existence and uniqueness of Nash equilibria in finite-duration SI epidemic games, showing they are a...
Computational Chemistry

Binding Free Energies without Alchemy

2026-03-12

This paper addresses the core bottleneck of computational expense in Absolute Binding Free Energy calculations by eliminating the need for numerous al...
Structural Biology

SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules

2026-03-12

This paper addresses the core bottleneck in cryo-EM helical reconstruction: eliminating the dependency on accurate initial symmetry parameter estimati...
Bioinformatics

Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

2026-03-12

This paper addresses the critical gap in evaluating AI-guided scientific selection strategies under realistic budget constraints, where existing metri...
Bioinformatics

Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

2026-03-12

This paper addresses the core challenge of accurately decomposing shared (joint) and dataset-specific (individual) sources of variation in multi-modal...

5 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

BioinformaticsComputational Biology

How to make the most of your masked language model for protein engineering

BigHat Biosciences

Calvin McCarter, Nick Bhattacharya, Sebastian W. Ober, Hunter Elliott

30秒速读

IN SHORT: This paper addresses the critical bottleneck of efficiently sampling high-quality, diverse protein sequences from Masked Language Models (MLMs) for practical antibody engineering, where traditional mutation-centric methods are computationally expensive and often produce dysfunctional variants.

核心创新

Methodology Proposes a novel sequence-centric stochastic beam search (SBS) method that reframes generation as a search problem, leveraging MLMs' efficiency in evaluating the pseudo-log-likelihood (PLL) of all 1-edit neighbors of a sequence, achieving a 20EL× speedup over mutation-centric methods.
Methodology Introduces a flexible, gradient-free multi-objective optimization (MOO) framework compatible with the SBS sampler, enabling guidance by arbitrary black-box scoring functions (e.g., binding affinity, humanness, stability) without requiring differentiability or partially-masked sequence inputs.
Biology Provides the first extensive head-to-head in vitro evaluation of MLM sampling algorithms and models in real antibody therapeutic campaigns, revealing that the choice of sampling algorithm is at least as impactful as the choice of model itself.

主要结论

The proposed stochastic beam search sampler significantly outperformed traditional Gibbs sampling in vitro, with AbLang2+SBS achieving higher success rates (e.g., perfect 100% success rate when combined with Smooth Tchebycheff Scalarization guidance).
Model choice matters: ESM2-650M (trained on generic proteins) and AbLang2 (antibody-specific) performed best in silico and in vitro, while the sampling algorithm choice (SBS vs. Gibbs) had an equal or greater impact on outcome quality.
Supervision is highly effective: Using a trained classifier for post-MLM ranking improved the success rate of AbLang2 outputs considerably, and MOO guidance (NDS/STS) during generation further enhanced performance and eliminated generation of very weak binders.

研究空白： Despite the proliferation of protein language models, there is a significant lack of systematic research and benchmarking on how to best sample from them to generate functional protein variants for practical optimization tasks. Existing mutation-centric sampling methods are computationally costly (O(EL³)), struggle with non-differentiable scoring functions, and often produce low-likelihood sequences.

摘要: A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.