Paper List

Computational Neuroscience

Translating Measures onto Mechanisms: The Cognitive Relevance of Higher-Order Information

2025-12-02

This review addresses the core challenge of translating abstract higher-order information theory metrics (e.g., synergy, redundancy) into defensible, ...
Artificial Intelligence

Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

2025-12-02

This paper addresses the critical gap in understanding whether LLMs spontaneously develop human-like Bayesian strategies for processing uncertain info...
Bioinformatics

Vessel Network Topology in Molecular Communication: Insights from Experiments and Theory

2025-12-02

This work addresses the critical lack of experimentally validated channel models for molecular communication within complex vessel networks, which is ...
Biophysics

Modulation of DNA rheology by a transcription factor that forms aging microgels

2025-12-02

This work addresses the fundamental question of how the transcription factor NANOG, essential for embryonic stem cell pluripotency, physically regulat...
Systems Biology

Imperfect molecular detection renormalizes apparent kinetic rates in stochastic gene regulatory networks

2025-12-02

This paper addresses the core challenge of distinguishing genuine stochastic dynamics of gene regulatory networks from artifacts introduced by imperfe...
Bioinformatics

PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer

2025-12-02

This paper addresses the dual challenge of achieving computational efficiency without sacrificing accuracy in whole-transcriptome single-cell represen...
Mathematical Biology

Beyond Bayesian Inference: The Correlation Integral Likelihood Framework and Gradient Flow Methods for Deterministic Sampling

2025-12-02

This paper addresses the core challenge of calibrating complex biological models (e.g., PDEs, agent-based models) with incomplete, noisy, or heterogen...
Bioinformatics

Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing

2025-12-02

This paper addresses the core challenge of detecting viral variants in wastewater sequencing data without reference genomes or labeled annotations, ov...

14 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-13

Human-Computer InteractionArtificial Intelligence

Developing the PsyCogMetrics™ AI Lab to Evaluate Large Language Models and Advance Cognitive Science

Marywood University | The University of Scranton | University of North Carolina Wilmington | California State University Dominguez Hills

Zhiye Jin, Yibai Li, K. D. Joshi, Xuefei (Nancy) Deng, Xiaobing (Emily) Li

30秒速读

IN SHORT: This paper addresses the critical gap between sophisticated LLM evaluation needs and the lack of accessible, scientifically rigorous platforms that integrate psychometric and cognitive science methodologies for non-technical stakeholders.

核心创新

Methodology Introduces the first cloud-based platform applying Classical Test Theory (CTT) and psychometric validity principles (Cronbach's α > .70, AVE > .50) to systematically evaluate LLMs as cognitive entities rather than mere tools.
Methodology Implements a three-cycle Action Design Science framework (Relevance-Rigor-Design) with nested Build–Intervene–Evaluate loops, bridging Popperian falsifiability, Cognitive Load Theory, and stakeholder requirements into a unified evaluation system.
Biology Validates that modern LLMs (GPT-4, LLaMA-3) satisfy core psychometric validity criteria—including convergent, discriminant, predictive, and external validity—and outperform earlier models (GPT-3.5, LLaMA-2) across these dimensions.

主要结论

The PsyCogMetrics™ AI Lab successfully operationalizes psychometric principles with demonstrated reliability metrics (Cronbach's α > .70) and validity frameworks (convergent/discriminant validity) for LLM evaluation.
The platform addresses three critical pain points: mitigates benchmark saturation through dynamic evaluation, reduces data contamination via reproducible workflows, and expands coverage through cognitive science methodologies.
Design validation shows GPT-4 and LLaMA-3 satisfy psychometric validity criteria and outperform earlier models, with GPT-4 reaching six-year-old human parity on Theory of Mind vignettes (Strachan et al., 2024).

研究空白： Current LLM evaluation suffers from benchmark saturation (new models achieve near-ceiling scores without real capability improvements), data contamination (test sets leak into training), lack of coverage for emerging capabilities, and developer-oriented tools that exclude psychology/cognitive science experts who lack programming infrastructure.

摘要: This study presents the development of the PsyCogMetrics™ AI Lab (https://psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build–Intervene–Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.