Paper List
-
A Unified Variational Principle for Branching Transport Networks: Wave Impedance, Viscous Flow, and Tissue Metabolism
This paper solves the core problem of predicting the empirically observed branching exponent (α≈2.7) in mammalian arterial trees, which neither Murray...
-
Household Bubbling Strategies for Epidemic Control and Social Connectivity
This paper addresses the core challenge of designing household merging (social bubble) strategies that effectively control epidemic risk while maximiz...
-
Empowering Chemical Structures with Biological Insights for Scalable Phenotypic Virtual Screening
This paper addresses the core challenge of bridging the gap between scalable chemical structure screening and biologically informative but resource-in...
-
A mechanical bifurcation constrains the evolution of cell sheet folding in the family Volvocaceae
This paper addresses the core problem of why there is an evolutionary gap in species with intermediate cell numbers (e.g., 256 cells) in Volvocaceae, ...
-
Bayesian Inference in Epidemic Modelling: A Beginner’s Guide Illustrated with the SIR Model
This guide addresses the core challenge of estimating uncertain epidemiological parameters (like transmission and recovery rates) from noisy, real-wor...
-
Geometric framework for biological evolution
This paper addresses the fundamental challenge of developing a coordinate-independent, geometric description of evolutionary dynamics that bridges gen...
-
A multiscale discrete-to-continuum framework for structured population models
This paper addresses the core challenge of systematically deriving uniformly valid continuum approximations from discrete structured population models...
-
Whole slide and microscopy image analysis with QuPath and OMERO
使QuPath能够直接分析存储在OMERO服务器中的图像而无需下载整个数据集,克服了大规模研究的本地存储限制。
Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study
Northwestern University | NSF-Simons National Institute for Theory and Mathematics in Biology
30秒速读
IN SHORT: This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SINDy), which leads to unstable and inaccurate recovery of governing equations from biological time-series data.
核心创新
- Methodology Quantitatively demonstrates that severe ill-conditioning (condition numbers up to 10^18) arises even with simple 2-3 term combinations in polynomial libraries, fundamentally limiting sparse identification methods.
- Methodology Shows that orthogonal polynomial bases (e.g., Legendre, Chebyshev) fail to improve conditioning when data distributions deviate from their theoretical weight functions, sometimes performing worse than monomials.
- Methodology Proposes and validates that aligning the data sampling distribution with the orthogonal basis's weight function can mitigate ill-conditioning and improve model recovery accuracy.
主要结论
- Ill-conditioning is pervasive in polynomial libraries for biological systems: condition numbers reach O(10^5) for Lotka-Volterra and O(10^18) for chemical reaction network models, leading to systematic model misidentification.
- Orthogonal polynomial bases are not a universal solution; they can worsen conditioning when data distributions (e.g., from constrained biological trajectories) deviate from the basis's required weight function.
- Distribution-aligned sampling is a key enabler: when data are sampled according to the orthogonal basis's weight function, conditioning improves significantly, enabling more accurate equation recovery.
摘要: Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models. Relevance to Life Sciences Numerical ill-conditioning is especially consequential in the model discovery for biological systems, where nonlinear interactions are often represented using nonlinear functions such as polynomials, and where multiscale dynamics, constrained state trajectories, and limited sampling due to experimental limitations can further amplify multicollinearity. We demonstrate these effects across benchmark models relevant to metabolic networks, regulatory networks, and population dynamics. Our results show that poor conditioning can impair the recovery of biologically meaningful governing equations, while sampling strategies matched to the candidate basis can improve identification accuracy. These results imply that a broader range of dynamic sampling is needed in most biological experiments to produce data sets that are suitable for data-driven model discovery with current methods. Mathematical Content This paper studies sparse regression-based equation discovery in the presence of multicollinearity and numerical ill-conditioning. We analyze the conditioning of candidate libraries, especially monomial and orthogonal polynomial bases, using condition numbers and model recovery under realistic sampling conditions with publicly available experimental data. We compare how basis choice and sampling distribution affect regression stability, sparsity, and the accuracy of recovered dynamical models.