Paper List
-
Macroscopic Dominance from Microscopic Extremes: Symmetry Breaking in Spatial Competition
This paper addresses the fundamental question of how microscopic stochastic advantages in spatial exploration translate into macroscopic resource domi...
-
Linear Readout of Neural Manifolds with Continuous Variables
This paper addresses the core challenge of quantifying how the geometric structure of high-dimensional neural population activity (neural manifolds) d...
-
Theory of Cell Body Lensing and Phototaxis Sign Reversal in “Eyeless” Mutants of Chlamydomonas
This paper solves the core puzzle of how eyeless mutants of Chlamydomonas exhibit reversed phototaxis by quantitatively modeling the competition betwe...
-
Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons
This paper addresses the challenge of predicting transcriptomic identity from electrophysiological recordings in human cortical interneurons, where li...
-
Uncovering statistical structure in large-scale neural activity with Restricted Boltzmann Machines
This paper addresses the core challenge of modeling large-scale neural population activity (1500-2000 neurons) with interpretable higher-order interac...
-
Realizing Common Random Numbers: Event-Keyed Hashing for Causally Valid Stochastic Models
This paper addresses the critical problem that standard stateful PRNG implementations in agent-based models violate causal validity by making random d...
-
A Standardized Framework for Evaluating Gene Expression Generative Models
This paper addresses the critical lack of standardized evaluation protocols for single-cell gene expression generative models, where inconsistent metr...
-
Single Molecule Localization Microscopy Challenge: A Biologically Inspired Benchmark for Long-Sequence Modeling
This paper addresses the core challenge of evaluating state-space models on biologically realistic, sparse, and stochastic temporal processes, which a...
Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study
Northwestern University | NSF-Simons National Institute for Theory and Mathematics in Biology
30秒速读
IN SHORT: This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SINDy), which leads to unstable and inaccurate recovery of governing equations from biological time-series data.
核心创新
- Methodology Quantitatively demonstrates that severe ill-conditioning (condition numbers up to 10^18) arises even with simple 2-3 term combinations in polynomial libraries, fundamentally limiting sparse identification methods.
- Methodology Shows that orthogonal polynomial bases (e.g., Legendre, Chebyshev) fail to improve conditioning when data distributions deviate from their theoretical weight functions, sometimes performing worse than monomials.
- Methodology Proposes and validates that aligning the data sampling distribution with the orthogonal basis's weight function can mitigate ill-conditioning and improve model recovery accuracy.
主要结论
- Ill-conditioning is pervasive in polynomial libraries for biological systems: condition numbers reach O(10^5) for Lotka-Volterra and O(10^18) for chemical reaction network models, leading to systematic model misidentification.
- Orthogonal polynomial bases are not a universal solution; they can worsen conditioning when data distributions (e.g., from constrained biological trajectories) deviate from the basis's required weight function.
- Distribution-aligned sampling is a key enabler: when data are sampled according to the orthogonal basis's weight function, conditioning improves significantly, enabling more accurate equation recovery.
摘要: Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models. Relevance to Life Sciences Numerical ill-conditioning is especially consequential in the model discovery for biological systems, where nonlinear interactions are often represented using nonlinear functions such as polynomials, and where multiscale dynamics, constrained state trajectories, and limited sampling due to experimental limitations can further amplify multicollinearity. We demonstrate these effects across benchmark models relevant to metabolic networks, regulatory networks, and population dynamics. Our results show that poor conditioning can impair the recovery of biologically meaningful governing equations, while sampling strategies matched to the candidate basis can improve identification accuracy. These results imply that a broader range of dynamic sampling is needed in most biological experiments to produce data sets that are suitable for data-driven model discovery with current methods. Mathematical Content This paper studies sparse regression-based equation discovery in the presence of multicollinearity and numerical ill-conditioning. We analyze the conditioning of candidate libraries, especially monomial and orthogonal polynomial bases, using condition numbers and model recovery under realistic sampling conditions with publicly available experimental data. We compare how basis choice and sampling distribution affect regression stability, sparsity, and the accuracy of recovered dynamical models.