Paper List

Game Theory

Evolutionarily Stable Stackelberg Equilibrium

2026-03-19

通过要求追随者策略对突变入侵具有鲁棒性，弥合了斯塔克尔伯格领导力模型与演化稳定性之间的鸿沟。
Computational Neuroscience

Recovering Sparse Neural Connectivity from Partial Measurements: A Covariance-Based Approach with Granger-Causality Refinement

2026-03-19

通过跨多个实验会话累积协方差统计，实现从部分记录到完整神经连接性的重建。
Bioinformatics

Atomic Trajectory Modeling with State Space Models for Biomolecular Dynamics

2026-03-18

ATMOS通过提供一个基于SSM的高效框架，用于生物分子的原子级轨迹生成，弥合了计算昂贵的MD模拟与时间受限的深度生成模型之间的差距。
Theoretical Ecology

Slow evolution towards generalism in a model of variable dietary range

2026-03-18

通过证明是种群统计噪声（而非确定性动力学）驱动了模式形成和泛化食性的演化，解决了间接竞争下物种形成的悖论。
Bioinformatics

Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search

2026-03-18

通过将印象草稿基于检索到的历史病例，并采用明确引用和基于置信度的拒绝机制，解决放射学报告生成中的幻觉问题。
Reinforcement Learning

Unified Policy–Value Decomposition for Rapid Adaptation

2026-03-18

通过双线性分解在策略和价值函数之间共享低维目标嵌入，实现对新颖任务的零样本适应。
Bioinformatics

Mathematical Modeling of Cancer–Bacterial Therapy: Analysis and Numerical Simulation via Physics-Informed Neural Networks

2026-03-18

提供了一个严格的、无网格的PINN框架，用于模拟和分析细菌癌症疗法中复杂的、空间异质的相互作用。
Bioinformatics

Sample-Efficient Adaptation of Drug-Response Models to Patient Tumors under Strong Biological Domain Shift

2026-03-17

通过从无标记分子谱中学习可迁移表征，利用最少的临床数据实现患者药物反应的有效预测。

«
1 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-13

Human-Computer InteractionArtificial Intelligence

Developing the PsyCogMetrics™ AI Lab to Evaluate Large Language Models and Advance Cognitive Science

Marywood University | The University of Scranton | University of North Carolina Wilmington | California State University Dominguez Hills

Zhiye Jin, Yibai Li, K. D. Joshi, Xuefei (Nancy) Deng, Xiaobing (Emily) Li

30秒速读

IN SHORT: This paper addresses the critical gap between sophisticated LLM evaluation needs and the lack of accessible, scientifically rigorous platforms that integrate psychometric and cognitive science methodologies for non-technical stakeholders.

核心创新

Methodology Introduces the first cloud-based platform applying Classical Test Theory (CTT) and psychometric validity principles (Cronbach's α > .70, AVE > .50) to systematically evaluate LLMs as cognitive entities rather than mere tools.
Methodology Implements a three-cycle Action Design Science framework (Relevance-Rigor-Design) with nested Build–Intervene–Evaluate loops, bridging Popperian falsifiability, Cognitive Load Theory, and stakeholder requirements into a unified evaluation system.
Biology Validates that modern LLMs (GPT-4, LLaMA-3) satisfy core psychometric validity criteria—including convergent, discriminant, predictive, and external validity—and outperform earlier models (GPT-3.5, LLaMA-2) across these dimensions.

主要结论

The PsyCogMetrics™ AI Lab successfully operationalizes psychometric principles with demonstrated reliability metrics (Cronbach's α > .70) and validity frameworks (convergent/discriminant validity) for LLM evaluation.
The platform addresses three critical pain points: mitigates benchmark saturation through dynamic evaluation, reduces data contamination via reproducible workflows, and expands coverage through cognitive science methodologies.
Design validation shows GPT-4 and LLaMA-3 satisfy psychometric validity criteria and outperform earlier models, with GPT-4 reaching six-year-old human parity on Theory of Mind vignettes (Strachan et al., 2024).

研究空白： Current LLM evaluation suffers from benchmark saturation (new models achieve near-ceiling scores without real capability improvements), data contamination (test sets leak into training), lack of coverage for emerging capabilities, and developer-oriented tools that exclude psychology/cognitive science experts who lack programming infrastructure.

摘要: This study presents the development of the PsyCogMetrics™ AI Lab (https://psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build–Intervene–Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.