Understanding LLMs benchmarks these days

4 min readMay 16, 2024

TL;DR version

AGI Eval: A test gauging AI models’ competency across diverse tasks like question answering, text classification, and language translation.
MMLU: An assessment measuring language models’ proficiency in generating coherent and precise text summaries from provided passages.
BigBench Hard: A rigorous evaluation of AI models’ capability to tackle intricate, multi-step questions necessitating reasoning and inference.
ANLI: An evaluation focusing on AI models’ adeptness in addressing natural language inference questions, demanding comprehension of sentence relationships.
HellaSwag: An assessment emphasizing AI models’ capacity to produce fluent, natural-sounding text summaries from given passages.
ARC Challenge: A test scrutinizing AI models’ capability to handle complex, multi-step questions, emphasizing comprehension and reasoning about abstract concepts.
ARC Easy: A benchmark evaluating AI models’ proficiency in answering straightforward questions, requiring fundamental understanding and recall.
Boo|Q: A test assessing AI models’ performance in responding to questions based on the Boo|Q dataset, centered on relationships between entities in a knowledge graph.
CommonsenseQA: An evaluation focusing on AI models’ ability to answer questions that necessitate common sense and everyday knowledge.
MedQA: An assessment evaluating AI models’ proficiency in addressing medical-related questions, demanding understanding of medical terminology and concepts.
OpenBookQA: A test gauging AI models’ competency in answering questions spanning various topics like science, history, and literature.
P|QA: A benchmark assessing AI models’ performance in responding to questions based on the P|QA dataset, focusing on relationships between entities in a knowledge graph.
Social|QA: An evaluation focusing on AI models’ ability to answer questions related to social media and online interactions.
Truthful|QA: A test evaluating AI models’ proficiency in responding to questions based on the Truthful|QA dataset, focusing on relationships between entities in a knowledge graph.
WinoGrande: An assessment evaluating AI models’ ability to answer questions related to wine, encompassing topics like wine production and tasting.
TriviaQA: A benchmark evaluating AI models’ ability to answer trivia questions across a wide range of topics.
GSM8K Chain of Thought: A test emphasizing AI models’ capacity to generate coherent and accurate text summaries from provided passages, focusing on natural-sounding text.
HumanEval: An evaluation assessing AI models’ performance across various tasks, aiming for results similar to human performance.
MBPP: A benchmark evaluating AI models’ performance across various tasks, aiming for results comparable to human performance.

APE version

1. AGI Eval: A test to see how good computers are at doing lots of different things like answering questions, sorting text, and translating languages.
2. MMLU: A test to check if computers can write good summaries of stories or articles.
3. BigBench Hard: A hard test to see if computers can answer really tricky questions that need a lot of thinking.
4. ANLI: A test to see if computers understand how sentences are related to each other.
5. HellaSwag: A test to see if computers can write stories that sound natural and make sense.
6. ARC Challenge: A test to see if computers can solve hard questions and understand tricky ideas.
7. ARC Easy: A test to see if computers can answer easy questions by remembering basic things.
8. Boo|Q: A test to see if computers can answer questions about how things are related to each other.
9. CommonsenseQA: A test to see if computers know basic things about everyday life.
10. MedQA: A test to see if computers know about medical stuff and can answer questions about it.
11. OpenBookQA: A test to see if computers can answer questions about lots of different topics.
12. P|QA: A test to see if computers can answer questions about how things are connected to each other.
13. Social|QA: A test to see if computers can answer questions about social media and online stuff.
14. Truthful|QA: A test to see if computers can answer questions about how things are connected to each other.
15. WinoGrande: A test to see if computers can answer questions about wine.
16. TriviaQA: A test to see if computers can answer fun questions about many different things.
17. GSM8K Chain of Thought: A test to see if computers can write good stories that sound natural.
18. HumanEval: A test to see if computers can do things as well as people can.
19. MBPP: A test to see if computers can do things as well as people can.

Understanding LLMs benchmarks these days

APE version

Written by Thongchan Thananate

No responses yet