Understanding LLMs benchmarks these days

Thongchan Thananate
4 min readMay 16, 2024

--

TL;DR version

  1. AGI Eval: A test gauging AI models’ competency across diverse tasks like question answering, text classification, and language translation.
  2. MMLU: An assessment measuring language models’ proficiency in generating coherent and precise text summaries from provided passages.
  3. BigBench Hard: A rigorous evaluation of AI models’ capability to tackle intricate, multi-step questions necessitating reasoning and inference.
  4. ANLI: An evaluation focusing on AI models’ adeptness in addressing natural language inference questions, demanding comprehension of sentence relationships.
  5. HellaSwag: An assessment emphasizing AI models’ capacity to produce fluent, natural-sounding text summaries from given passages.
  6. ARC Challenge: A test scrutinizing AI models’ capability to handle complex, multi-step questions, emphasizing comprehension and reasoning about abstract concepts.
  7. ARC Easy: A benchmark evaluating AI models’ proficiency in answering straightforward questions, requiring fundamental understanding and recall.
  8. Boo|Q: A test assessing AI models’ performance in responding to questions based on the Boo|Q dataset, centered on relationships between entities in a knowledge graph.
  9. CommonsenseQA: An evaluation focusing on AI models’ ability to answer questions that necessitate common sense and everyday knowledge.
  10. MedQA: An assessment evaluating AI models’ proficiency in addressing medical-related questions, demanding understanding of medical terminology and concepts.
  11. OpenBookQA: A test gauging AI models’ competency in answering questions spanning various topics like science, history, and literature.
  12. P|QA: A benchmark assessing AI models’ performance in responding to questions based on the P|QA dataset, focusing on relationships between entities in a knowledge graph.
  13. Social|QA: An evaluation focusing on AI models’ ability to answer questions related to social media and online interactions.
  14. Truthful|QA: A test evaluating AI models’ proficiency in responding to questions based on the Truthful|QA dataset, focusing on relationships between entities in a knowledge graph.
  15. WinoGrande: An assessment evaluating AI models’ ability to answer questions related to wine, encompassing topics like wine production and tasting.
  16. TriviaQA: A benchmark evaluating AI models’ ability to answer trivia questions across a wide range of topics.
  17. GSM8K Chain of Thought: A test emphasizing AI models’ capacity to generate coherent and accurate text summaries from provided passages, focusing on natural-sounding text.
  18. HumanEval: An evaluation assessing AI models’ performance across various tasks, aiming for results similar to human performance.
  19. MBPP: A benchmark evaluating AI models’ performance across various tasks, aiming for results comparable to human performance.

APE version

1. AGI Eval: A test to see how good computers are at doing lots of different things like answering questions, sorting text, and translating languages.
2. MMLU: A test to check if computers can write good summaries of stories or articles.
3. BigBench Hard: A hard test to see if computers can answer really tricky questions that need a lot of thinking.
4. ANLI: A test to see if computers understand how sentences are related to each other.
5. HellaSwag: A test to see if computers can write stories that sound natural and make sense.
6. ARC Challenge: A test to see if computers can solve hard questions and understand tricky ideas.
7. ARC Easy: A test to see if computers can answer easy questions by remembering basic things.
8. Boo|Q: A test to see if computers can answer questions about how things are related to each other.
9. CommonsenseQA: A test to see if computers know basic things about everyday life.
10. MedQA: A test to see if computers know about medical stuff and can answer questions about it.
11. OpenBookQA: A test to see if computers can answer questions about lots of different topics.
12. P|QA: A test to see if computers can answer questions about how things are connected to each other.
13. Social|QA: A test to see if computers can answer questions about social media and online stuff.
14. Truthful|QA: A test to see if computers can answer questions about how things are connected to each other.
15. WinoGrande: A test to see if computers can answer questions about wine.
16. TriviaQA: A test to see if computers can answer fun questions about many different things.
17. GSM8K Chain of Thought: A test to see if computers can write good stories that sound natural.
18. HumanEval: A test to see if computers can do things as well as people can.
19. MBPP: A test to see if computers can do things as well as people can.

--

--

Thongchan Thananate
Thongchan Thananate

Written by Thongchan Thananate

People might laugh at it or call it foolish logic, but that’s enough for me. That’s what romanticism is about!

No responses yet