Paper List

Health Informatics

An AI Implementation Science Study to Improve Trustworthy Data in a Large Healthcare System

2025-12-01

This paper addresses the critical gap between theoretical AI research and real-world clinical implementation by providing a practical framework for as...
Bioinformatics

The BEAT-CF Causal Model: A model for guiding the design of trials and observational analyses of cystic fibrosis exacerbations

2025-12

This paper addresses the critical gap in cystic fibrosis exacerbation management by providing a formal causal framework that integrates expert knowled...
Bioinformatics

Hierarchical Molecular Language Models (HMLMs)

2025-11-30

This paper addresses the core challenge of accurately modeling context-dependent signaling, pathway cross-talk, and temporal dynamics across multiple ...
Computational Neuroscience

Stability analysis of action potential generation using Markov models of voltage‑gated sodium channel isoforms

2025-11-30

This work addresses the challenge of systematically characterizing how the high-dimensional parameter space of Markov models for different sodium chan...
Network Science

Approximate Bayesian Inference on Mechanisms of Network Growth and Evolution

2025-11-30

This paper addresses the core challenge of inferring the relative contributions of multiple, simultaneous generative mechanisms in network formation w...
Bioinformatics

EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants

2025-11-29

This paper addresses the core challenge of jointly predicting enzyme kinetic parameters (Kcat and Km) by modeling dynamic enzyme-substrate interaction...
Biophysics

Tissue stress measurements with Bayesian Inversion Stress Microscopy

2025-11-29

This paper addresses the core challenge of measuring absolute, tissue-scale mechanical stress without making assumptions about tissue rheology, which ...
Bioinformatics

DeepFRI Demystified: Interpretability vs. Accuracy in AI Protein Function Prediction

2025-11-29

This study addresses the critical gap between high predictive accuracy and biological interpretability in DeepFRI, revealing that the model often prio...

16 / 18

期刊: ArXiv Preprint

发布日期: 2026-03-13

Human-Computer InteractionArtificial Intelligence

Developing the PsyCogMetrics™ AI Lab to Evaluate Large Language Models and Advance Cognitive Science

Marywood University | The University of Scranton | University of North Carolina Wilmington | California State University Dominguez Hills

Zhiye Jin, Yibai Li, K. D. Joshi, Xuefei (Nancy) Deng, Xiaobing (Emily) Li

30秒速读

IN SHORT: This paper addresses the critical gap between sophisticated LLM evaluation needs and the lack of accessible, scientifically rigorous platforms that integrate psychometric and cognitive science methodologies for non-technical stakeholders.

核心创新

Methodology Introduces the first cloud-based platform applying Classical Test Theory (CTT) and psychometric validity principles (Cronbach's α > .70, AVE > .50) to systematically evaluate LLMs as cognitive entities rather than mere tools.
Methodology Implements a three-cycle Action Design Science framework (Relevance-Rigor-Design) with nested Build–Intervene–Evaluate loops, bridging Popperian falsifiability, Cognitive Load Theory, and stakeholder requirements into a unified evaluation system.
Biology Validates that modern LLMs (GPT-4, LLaMA-3) satisfy core psychometric validity criteria—including convergent, discriminant, predictive, and external validity—and outperform earlier models (GPT-3.5, LLaMA-2) across these dimensions.

主要结论

The PsyCogMetrics™ AI Lab successfully operationalizes psychometric principles with demonstrated reliability metrics (Cronbach's α > .70) and validity frameworks (convergent/discriminant validity) for LLM evaluation.
The platform addresses three critical pain points: mitigates benchmark saturation through dynamic evaluation, reduces data contamination via reproducible workflows, and expands coverage through cognitive science methodologies.
Design validation shows GPT-4 and LLaMA-3 satisfy psychometric validity criteria and outperform earlier models, with GPT-4 reaching six-year-old human parity on Theory of Mind vignettes (Strachan et al., 2024).

研究空白： Current LLM evaluation suffers from benchmark saturation (new models achieve near-ceiling scores without real capability improvements), data contamination (test sets leak into training), lack of coverage for emerging capabilities, and developer-oriented tools that exclude psychology/cognitive science experts who lack programming infrastructure.

摘要: This study presents the development of the PsyCogMetrics™ AI Lab (https://psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build–Intervene–Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.