Paper List
-
GOPHER: Optimization-based Phenotype Randomization for Genome-Wide Association Studies with Differential Privacy
This paper addresses the core challenge of balancing rigorous privacy protection with data utility when releasing full GWAS summary statistics, overco...
-
Real-time Cricket Sorting By Sex A low-cost embedded solution using YOLOv8 and Raspberry Pi
This paper addresses the critical bottleneck in industrial insect farming: the lack of automated, real-time sex sorting systems for Acheta domesticus ...
-
Training Dynamics of Learning 3D-Rotational Equivariance
This work addresses the core dilemma of whether to use computationally expensive equivariant architectures or faster symmetry-agnostic models with dat...
-
Fast and Accurate Node-Age Estimation Under Fossil Calibration Uncertainty Using the Adjusted Pairwise Likelihood
This paper addresses the dual challenge of computational inefficiency and sensitivity to fossil calibration errors in Bayesian divergence time estimat...
-
Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training
This paper addresses the core challenge of accurately predicting protein fitness with only a handful of experimental observations, where data collecti...
-
scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing
This paper addresses the critical gap of fragmented and non-standardized benchmarking in single-cell RNA-seq clustering, which hinders objective compa...
-
Simulation and inference methods for non-Markovian stochastic biochemical reaction networks
This paper addresses the computational bottleneck of simulating and performing Bayesian inference for non-Markovian biochemical systems with history-d...
-
Assessment of Simulation-based Inference Methods for Stochastic Compartmental Models
This paper addresses the core challenge of performing accurate Bayesian parameter inference for stochastic epidemic models when the likelihood functio...
Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study
Northwestern University | NSF-Simons National Institute for Theory and Mathematics in Biology
30秒速读
IN SHORT: This paper addresses the critical challenge of numerical ill-conditioning and multicollinearity in library-based sparse regression methods (e.g., SINDy), which leads to unstable and inaccurate recovery of governing equations from biological time-series data.
核心创新
- Methodology Quantitatively demonstrates that severe ill-conditioning (condition numbers up to 10^18) arises even with simple 2-3 term combinations in polynomial libraries, fundamentally limiting sparse identification methods.
- Methodology Shows that orthogonal polynomial bases (e.g., Legendre, Chebyshev) fail to improve conditioning when data distributions deviate from their theoretical weight functions, sometimes performing worse than monomials.
- Methodology Proposes and validates that aligning the data sampling distribution with the orthogonal basis's weight function can mitigate ill-conditioning and improve model recovery accuracy.
主要结论
- Ill-conditioning is pervasive in polynomial libraries for biological systems: condition numbers reach O(10^5) for Lotka-Volterra and O(10^18) for chemical reaction network models, leading to systematic model misidentification.
- Orthogonal polynomial bases are not a universal solution; they can worsen conditioning when data distributions (e.g., from constrained biological trajectories) deviate from the basis's required weight function.
- Distribution-aligned sampling is a key enabler: when data are sampled according to the orthogonal basis's weight function, conditioning improves significantly, enabling more accurate equation recovery.
摘要: Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models. Relevance to Life Sciences Numerical ill-conditioning is especially consequential in the model discovery for biological systems, where nonlinear interactions are often represented using nonlinear functions such as polynomials, and where multiscale dynamics, constrained state trajectories, and limited sampling due to experimental limitations can further amplify multicollinearity. We demonstrate these effects across benchmark models relevant to metabolic networks, regulatory networks, and population dynamics. Our results show that poor conditioning can impair the recovery of biologically meaningful governing equations, while sampling strategies matched to the candidate basis can improve identification accuracy. These results imply that a broader range of dynamic sampling is needed in most biological experiments to produce data sets that are suitable for data-driven model discovery with current methods. Mathematical Content This paper studies sparse regression-based equation discovery in the presence of multicollinearity and numerical ill-conditioning. We analyze the conditioning of candidate libraries, especially monomial and orthogonal polynomial bases, using condition numbers and model recovery under realistic sampling conditions with publicly available experimental data. We compare how basis choice and sampling distribution affect regression stability, sparsity, and the accuracy of recovered dynamical models.