Paper List
-
STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings
This paper addresses the core challenge of generalizing protein function prediction to unseen or newly introduced Gene Ontology (GO) terms by overcomi...
-
Incorporating indel channels into average-case analysis of seed-chain-extend
This paper addresses the core pain point of bridging the theoretical gap for the widely used seed-chain-extend heuristic by providing the first rigoro...
-
Competition, stability, and functionality in excitatory-inhibitory neural circuits
This paper addresses the core challenge of extending interpretable energy-based frameworks to biologically realistic asymmetric neural networks, where...
-
Enhancing Clinical Note Generation with ICD-10, Clinical Ontology Knowledge Graphs, and Chain-of-Thought Prompting Using GPT-4
This paper addresses the core challenge of generating accurate and clinically relevant patient notes from sparse inputs (ICD codes and basic demograph...
-
Learning From Limited Data and Feedback for Cell Culture Process Monitoring: A Comparative Study
This paper addresses the core challenge of developing accurate real-time bioprocess monitoring soft sensors under severe data constraints: limited his...
-
Cell-cell communication inference and analysis: biological mechanisms, computational approaches, and future opportunities
This review addresses the critical need for a systematic framework to navigate the rapidly expanding landscape of computational methods for inferring ...
-
Generating a Contact Matrix for Aged Care Settings in Australia: an agent-based model study
This study addresses the critical gap in understanding heterogeneous contact patterns within aged care facilities, where existing population-level con...
-
Emergent Spatiotemporal Dynamics in Large-Scale Brain Networks with Next Generation Neural Mass Models
This work addresses the core challenge of understanding how complex, brain-wide spatiotemporal patterns emerge from the interaction of biophysically d...
Developing the PsyCogMetrics™ AI Lab to Evaluate Large Language Models and Advance Cognitive Science
Marywood University | The University of Scranton | University of North Carolina Wilmington | California State University Dominguez Hills
30秒速读
IN SHORT: This paper addresses the critical gap between sophisticated LLM evaluation needs and the lack of accessible, scientifically rigorous platforms that integrate psychometric and cognitive science methodologies for non-technical stakeholders.
核心创新
- Methodology Introduces the first cloud-based platform applying Classical Test Theory (CTT) and psychometric validity principles (Cronbach's α > .70, AVE > .50) to systematically evaluate LLMs as cognitive entities rather than mere tools.
- Methodology Implements a three-cycle Action Design Science framework (Relevance-Rigor-Design) with nested Build–Intervene–Evaluate loops, bridging Popperian falsifiability, Cognitive Load Theory, and stakeholder requirements into a unified evaluation system.
- Biology Validates that modern LLMs (GPT-4, LLaMA-3) satisfy core psychometric validity criteria—including convergent, discriminant, predictive, and external validity—and outperform earlier models (GPT-3.5, LLaMA-2) across these dimensions.
主要结论
- The PsyCogMetrics™ AI Lab successfully operationalizes psychometric principles with demonstrated reliability metrics (Cronbach's α > .70) and validity frameworks (convergent/discriminant validity) for LLM evaluation.
- The platform addresses three critical pain points: mitigates benchmark saturation through dynamic evaluation, reduces data contamination via reproducible workflows, and expands coverage through cognitive science methodologies.
- Design validation shows GPT-4 and LLaMA-3 satisfy psychometric validity criteria and outperform earlier models, with GPT-4 reaching six-year-old human parity on Theory of Mind vignettes (Strachan et al., 2024).
摘要: This study presents the development of the PsyCogMetrics™ AI Lab (https://psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build–Intervene–Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.