Skip to content

Use statistical methods to re-evaluate massive LLMs performance

Notifications You must be signed in to change notification settings

fivehills/Reevaluating-LLM-performance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This project is to employ multifaceted stastsitical methods to reevaluate LLMs performance.

R code:

  1. "anova.test.R" is used to do anova tests and Tukey tests;
  2. "GAMMs_tests.R" for implementing GAMM tests;
  3. "Plot_GAMMs. R" for plotting partial effects;
  4. "tsne_cluster.R" for tsne testing and plots.

Cite:

@article{sun2024comprehensive,
  title={Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach},
  author={Sun, Kun and Wang, Rong and S{\o}gaard, Anders},
  journal={arXiv preprint arXiv:2403.15250},
  year={2024}
}

The main content

Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish groundbreaking perspectives on the intrinsic nature and developmental trajectories of LLMs, underpinning a more nuanced approach to AI development. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study not only contributes a nuanced perspective on LLM efficiency and potentials but also drives informed engagement with AI technologies and advances the path to AGI.

About

Use statistical methods to re-evaluate massive LLMs performance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages