Paper List
-
Mapping of Lesion Images to Somatic Mutations
This paper addresses the critical bottleneck of delayed genetic analysis in cancer diagnosis by predicting a patient's full somatic mutation profile d...
-
Reinventing Clinical Dialogue: Agentic Paradigms for LLM‑Enabled Healthcare Communication
This paper addresses the core challenge of transforming reactive, stateless LLMs into autonomous, reliable clinical dialogue agents capable of longitu...
-
Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization
通过将序列映射到二元潜在空间进行基于QUBO的适应度优化,桥接蛋白质表示学习和组合优化。
-
Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement
证明了无模型强化学习可以利用虚拟视觉刺激有效引导鱼群,克服了缺乏精确行为模型的问题。
Developing the PsyCogMetrics™ AI Lab to Evaluate Large Language Models and Advance Cognitive Science
Marywood University | The University of Scranton | University of North Carolina Wilmington | California State University Dominguez Hills
30秒速读
IN SHORT: This paper addresses the critical gap between sophisticated LLM evaluation needs and the lack of accessible, scientifically rigorous platforms that integrate psychometric and cognitive science methodologies for non-technical stakeholders.
核心创新
- Methodology Introduces the first cloud-based platform applying Classical Test Theory (CTT) and psychometric validity principles (Cronbach's α > .70, AVE > .50) to systematically evaluate LLMs as cognitive entities rather than mere tools.
- Methodology Implements a three-cycle Action Design Science framework (Relevance-Rigor-Design) with nested Build–Intervene–Evaluate loops, bridging Popperian falsifiability, Cognitive Load Theory, and stakeholder requirements into a unified evaluation system.
- Biology Validates that modern LLMs (GPT-4, LLaMA-3) satisfy core psychometric validity criteria—including convergent, discriminant, predictive, and external validity—and outperform earlier models (GPT-3.5, LLaMA-2) across these dimensions.
主要结论
- The PsyCogMetrics™ AI Lab successfully operationalizes psychometric principles with demonstrated reliability metrics (Cronbach's α > .70) and validity frameworks (convergent/discriminant validity) for LLM evaluation.
- The platform addresses three critical pain points: mitigates benchmark saturation through dynamic evaluation, reduces data contamination via reproducible workflows, and expands coverage through cognitive science methodologies.
- Design validation shows GPT-4 and LLaMA-3 satisfy psychometric validity criteria and outperform earlier models, with GPT-4 reaching six-year-old human parity on Theory of Mind vignettes (Strachan et al., 2024).
摘要: This study presents the development of the PsyCogMetrics™ AI Lab (https://psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build–Intervene–Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.