Understanding LLMs benchmarks these days

Thongchan Thananate
4 min readMay 16, 2024

--

TL;DR version

  1. AGI Eval: A test gauging AI models’ competency across diverse tasks like question answering, text classification, and language translation.
  2. MMLU: An assessment measuring language models’ proficiency in generating coherent and precise text summaries from provided passages.
  3. BigBench Hard: A rigorous evaluation of AI models’ capability to tackle intricate, multi-step questions necessitating reasoning and inference.
  4. ANLI: An evaluation focusing on AI models’ adeptness in addressing natural language inference questions, demanding comprehension of sentence relationships.
  5. HellaSwag: An assessment emphasizing AI models’ capacity to produce fluent, natural-sounding text summaries from given passages.
  6. ARC Challenge: A test scrutinizing AI models’ capability to handle complex, multi-step questions, emphasizing comprehension and reasoning about abstract concepts.
  7. ARC Easy: A benchmark evaluating AI models’ proficiency in answering straightforward questions, requiring fundamental understanding and recall.
  8. Boo|Q: A test assessing AI models’ performance in responding to questions based on the Boo|Q dataset, centered on relationships between entities in a knowledge graph.
  9. CommonsenseQA: An evaluation focusing on AI models’ ability to answer questions that necessitate common sense and everyday knowledge.
  10. MedQA: An assessment evaluating AI models’ proficiency in addressing medical-related questions, demanding understanding of medical terminology and concepts.
  11. OpenBookQA: A test gauging AI models’ competency in answering questions spanning various topics like science, history, and literature.
  12. P|QA: A benchmark assessing AI models’ performance in responding to questions based on the P|QA dataset, focusing on relationships between entities in a knowledge graph.
  13. Social|QA: An evaluation focusing on AI models’ ability to answer questions related to social media and online interactions.
  14. Truthful|QA: A test evaluating AI models’ proficiency in responding to questions based on the Truthful|QA dataset, focusing on relationships between entities in a knowledge graph.
  15. WinoGrande: An assessment evaluating AI models’ ability to answer questions related to wine, encompassing topics like wine production and tasting.
  16. TriviaQA: A benchmark evaluating AI models’ ability to answer trivia questions across a wide range of topics.
  17. GSM8K Chain of Thought: A test emphasizing AI models’ capacity to generate coherent and accurate text summaries from provided passages, focusing on natural-sounding text.
  18. HumanEval: An evaluation assessing AI models’ performance across various tasks, aiming for results similar to human performance.
  19. MBPP: A benchmark evaluating AI models’ performance across various tasks, aiming for results comparable to human performance.

APE version

1. AGI Eval: A test to see how good computers are at doing lots of different things like answering questions, sorting text, and translating languages.
2. MMLU: A test to check if computers can write good summaries of stories or articles.
3. BigBench Hard: A hard test to see if computers can answer really tricky questions that need a lot of thinking.
4. ANLI: A test to see if computers understand how sentences are related to each other.
5. HellaSwag: A test to see if computers can write stories that sound natural and make sense.
6. ARC Challenge: A test to see if computers can solve hard questions and understand tricky ideas.
7. ARC Easy: A test to see if computers can answer easy questions by remembering basic things.
8. Boo|Q: A test to see if computers can answer questions about how things are related to each other.
9. CommonsenseQA: A test to see if computers know basic things about everyday life.
10. MedQA: A test to see if computers know about medical stuff and can answer questions about it.
11. OpenBookQA: A test to see if computers can answer questions about lots of different topics.
12. P|QA: A test to see if computers can answer questions about how things are connected to each other.
13. Social|QA: A test to see if computers can answer questions about social media and online stuff.
14. Truthful|QA: A test to see if computers can answer questions about how things are connected to each other.
15. WinoGrande: A test to see if computers can answer questions about wine.
16. TriviaQA: A test to see if computers can answer fun questions about many different things.
17. GSM8K Chain of Thought: A test to see if computers can write good stories that sound natural.
18. HumanEval: A test to see if computers can do things as well as people can.
19. MBPP: A test to see if computers can do things as well as people can.

--

--

Thongchan Thananate

People might laugh at it or call it foolish logic, but that’s enough for me. That’s what romanticism is about!