Paper List

Epidemiology

The Effective Reproduction Number in the Kermack-McKendrick model with age of infection and reinfection

2025-12-05

This paper addresses the challenge of accurately estimating the time-varying effective reproduction number ℛ(t) in epidemics by incorporating two crit...
Computational Neuroscience

Covering Relations in the Poset of Combinatorial Neural Codes

2025-12-04

This work addresses the core challenge of navigating the complex poset structure of neural codes to systematically test the conjecture linking convex ...
Physical Chemistry

Collective adsorption of pheromones at the water-air interface

2025-12-04

This paper addresses the core challenge of understanding how amphiphilic pheromones, previously assumed to be transported in the gas phase, can be sta...
Bioinformatics

pHapCompass: Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase

2025-12-04

This paper addresses the core challenge of accurately assembling polyploid haplotypes from sequencing data, where read assignment ambiguity and an exp...
Computational Neuroscience

Setting up for failure: automatic discovery of the neural mechanisms of cognitive errors

2025-12-04

This paper addresses the core challenge of automating the discovery of biologically plausible recurrent neural network (RNN) dynamics that can replica...
Cognitive Neuroscience

Influence of Object Affordance on Action Language Understanding: Evidence from Dynamic Causal Modeling Analysis

2025-12-04

This study addresses the core challenge of moving beyond correlational evidence to establish the *causal direction* and *temporal dynamics* of how obj...
Neuroscience

Revealing stimulus-dependent dynamics through statistical complexity

2025-12-04

This paper addresses the core challenge of detecting stimulus-specific patterns in neural population dynamics that remain hidden to traditional variab...
Biophysics

Exactly Solvable Population Model with Square-Root Growth Noise and Cell-Size Regulation

2025-12-04

This paper addresses the fundamental gap in understanding how microscopic growth fluctuations, specifically those with size-dependent (square-root) no...

11 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-13

Human-Computer InteractionArtificial Intelligence

Developing the PsyCogMetrics™ AI Lab to Evaluate Large Language Models and Advance Cognitive Science

Marywood University | The University of Scranton | University of North Carolina Wilmington | California State University Dominguez Hills

Zhiye Jin, Yibai Li, K. D. Joshi, Xuefei (Nancy) Deng, Xiaobing (Emily) Li

30秒速读

IN SHORT: This paper addresses the critical gap between sophisticated LLM evaluation needs and the lack of accessible, scientifically rigorous platforms that integrate psychometric and cognitive science methodologies for non-technical stakeholders.

核心创新

Methodology Introduces the first cloud-based platform applying Classical Test Theory (CTT) and psychometric validity principles (Cronbach's α > .70, AVE > .50) to systematically evaluate LLMs as cognitive entities rather than mere tools.
Methodology Implements a three-cycle Action Design Science framework (Relevance-Rigor-Design) with nested Build–Intervene–Evaluate loops, bridging Popperian falsifiability, Cognitive Load Theory, and stakeholder requirements into a unified evaluation system.
Biology Validates that modern LLMs (GPT-4, LLaMA-3) satisfy core psychometric validity criteria—including convergent, discriminant, predictive, and external validity—and outperform earlier models (GPT-3.5, LLaMA-2) across these dimensions.

主要结论

The PsyCogMetrics™ AI Lab successfully operationalizes psychometric principles with demonstrated reliability metrics (Cronbach's α > .70) and validity frameworks (convergent/discriminant validity) for LLM evaluation.
The platform addresses three critical pain points: mitigates benchmark saturation through dynamic evaluation, reduces data contamination via reproducible workflows, and expands coverage through cognitive science methodologies.
Design validation shows GPT-4 and LLaMA-3 satisfy psychometric validity criteria and outperform earlier models, with GPT-4 reaching six-year-old human parity on Theory of Mind vignettes (Strachan et al., 2024).

研究空白： Current LLM evaluation suffers from benchmark saturation (new models achieve near-ceiling scores without real capability improvements), data contamination (test sets leak into training), lack of coverage for emerging capabilities, and developer-oriented tools that exclude psychology/cognitive science experts who lack programming infrastructure.

摘要: This study presents the development of the PsyCogMetrics™ AI Lab (https://psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build–Intervene–Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.