Paper List

Bioinformatics

GOPHER: Optimization-based Phenotype Randomization for Genome-Wide Association Studies with Differential Privacy

2025-12-03

This paper addresses the core challenge of balancing rigorous privacy protection with data utility when releasing full GWAS summary statistics, overco...
Computer Vision

Real-time Cricket Sorting By Sex A low-cost embedded solution using YOLOv8 and Raspberry Pi

2025-12-03

This paper addresses the critical bottleneck in industrial insect farming: the lack of automated, real-time sex sorting systems for Acheta domesticus ...
Machine Learning

Training Dynamics of Learning 3D-Rotational Equivariance

2025-12-02

This work addresses the core dilemma of whether to use computationally expensive equivariant architectures or faster symmetry-agnostic models with dat...
Bioinformatics

Fast and Accurate Node-Age Estimation Under Fossil Calibration Uncertainty Using the Adjusted Pairwise Likelihood

2025-12-02

This paper addresses the dual challenge of computational inefficiency and sensitivity to fossil calibration errors in Bayesian divergence time estimat...
Bioinformatics

Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training

2025-12-02

This paper addresses the core challenge of accurately predicting protein fitness with only a handful of experimental observations, where data collecti...
Bioinformatics

scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing

2025-12-02

This paper addresses the critical gap of fragmented and non-standardized benchmarking in single-cell RNA-seq clustering, which hinders objective compa...
Bioinformatics

Simulation and inference methods for non-Markovian stochastic biochemical reaction networks

2025-12-02

This paper addresses the computational bottleneck of simulating and performing Bayesian inference for non-Markovian biochemical systems with history-d...
Bioinformatics

Assessment of Simulation-based Inference Methods for Stochastic Compartmental Models

2025-12-02

This paper addresses the core challenge of performing accurate Bayesian parameter inference for stochastic epidemic models when the likelihood functio...

13 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-11

Systems BiologyScientific Machine Learning

Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study

Northwestern University | NSF-Simons National Institute for Theory and Mathematics in Biology

Yuxiang Feng, Niall M. Mangan, Manu Jayadharan

30秒速读

IN SHORT: This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SINDy), which leads to unstable and inaccurate recovery of governing equations from biological time-series data.

核心创新

Methodology Quantitatively demonstrates that severe ill-conditioning (condition numbers up to 10^18) arises even with simple 2-3 term combinations in polynomial libraries, fundamentally limiting sparse identification methods.
Methodology Shows that orthogonal polynomial bases (e.g., Legendre, Chebyshev) fail to improve conditioning when data distributions deviate from their theoretical weight functions, sometimes performing worse than monomials.
Methodology Proposes and validates that aligning the data sampling distribution with the orthogonal basis's weight function can mitigate ill-conditioning and improve model recovery accuracy.

主要结论

Ill-conditioning is pervasive in polynomial libraries for biological systems: condition numbers reach O(10^5) for Lotka-Volterra and O(10^18) for chemical reaction network models, leading to systematic model misidentification.
Orthogonal polynomial bases are not a universal solution; they can worsen conditioning when data distributions (e.g., from constrained biological trajectories) deviate from the basis's required weight function.
Distribution-aligned sampling is a key enabler: when data are sampled according to the orthogonal basis's weight function, conditioning improves significantly, enabling more accurate equation recovery.

研究空白： While identifiability issues (e.g., sloppy parameters) are well-studied in systems biology, a systematic analysis of numerical ill-conditioning and its impact on data-driven model discovery using sparse regression libraries has been lacking, particularly regarding the interplay between basis choice and data distribution.

摘要: Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models. Relevance to Life Sciences Numerical ill-conditioning is especially consequential in the model discovery for biological systems, where nonlinear interactions are often represented using nonlinear functions such as polynomials, and where multiscale dynamics, constrained state trajectories, and limited sampling due to experimental limitations can further amplify multicollinearity. We demonstrate these effects across benchmark models relevant to metabolic networks, regulatory networks, and population dynamics. Our results show that poor conditioning can impair the recovery of biologically meaningful governing equations, while sampling strategies matched to the candidate basis can improve identification accuracy. These results imply that a broader range of dynamic sampling is needed in most biological experiments to produce data sets that are suitable for data-driven model discovery with current methods. Mathematical Content This paper studies sparse regression-based equation discovery in the presence of multicollinearity and numerical ill-conditioning. We analyze the conditioning of candidate libraries, especially monomial and orthogonal polynomial bases, using condition numbers and model recovery under realistic sampling conditions with publicly available experimental data. We compare how basis choice and sampling distribution affect regression stability, sparsity, and the accuracy of recovered dynamical models.