Which AI Model Should You Use? (Check Benchmarks)

by RutkatApril 19th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Read the most common benchmark scores for AI model accuracy then choose one that fits your needs.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Which AI Model Should You Use? (Check Benchmarks)
Rutkat HackerNoon profile picture

There are more AI models than what you see in the news and on social media. There are hundreds, including open-source models, private ones, and the tech giant’s own: Gemini, Claude, OpenAI, Grok, Deepseek. What is a model really? Is it just a block box of data? Almost! You can think of it as a zip file of the internet with a bit of C++ code that communicates with the zip file. I credit this analogy to Andrej Karpathy, although I’m not sure if it was his idea, but he is the real industry expert.


An AI model is a neural network utilizing a set of massive data to recognize specific patterns. Now is the time to take advantage of them and choose wisely whether for business, personal assistance, or creativity enhancement. The objective of this guide is not about “model training”; it is geared for individuals new to the field of AI who want a better understanding and to leverage the technology. You can build with AI, not over it, so after reading this guide, the knowledge gained shall be understanding general concepts, usage, and measuring accuracy.


In this AI guide, you will learn the following, so you can skip to any section, but if you’re a beginner, read this entire article:

  1. Category of Models
  2. Corresponding Tasks of Models
  3. Naming Convention of Models
  4. Accuracy Performance of Models
  5. Benchmark References


As a beginner or just hearing about the popular tools, make note that there isn’t one type, multi-use case model that does everything you ask it to do. From the interface, it may appear that you are just typing to a chatbot, but there is a lot more being executed in the background. Business analysts, product managers, and engineers adopting AI can identify the objective they have and select from a category of AI models.


Here are 4 categories of models among many:

  • Natural Language Processing (general)
  • Generative (Image, Video, Audio, Text, Code)
  • Discriminative (Computer Vision, Text Analysis)
  • Reinforcement Learning


While most models specialize in one category, others are multi-modal with different levels of accuracy. Every model has been trained on specific data and, therefore, can do specific tasks related to the data it was trained on. Here’s a list of common tasks each category of models can do:

Natural Language Processing

Enables computers to interpret, understand, and generate natural human language using tokenization and statistical models. Examples are chatbots, and the most common one is ChatGPT, which stands for “generative pre-trained transformer”. Most models are actually pre-trained transformers.

Generative (Image, Video, Audio, Text, Code)

They are Generative Adversarial Networks (GANs), which use two sub-models known as a generator and a discriminator. Realistic imagery, audio, text, and code can be produced based on tons of data it was trained on. Stable Diffusion is the most popular method of generating images and videos.

Discriminative (Computer Vision, Text Analysis)

These use algorithms designed to learn different classes of datasets for decision-making. They include sentiment analysis, optical recognition, and sentiment analysis.

Reinforcement Learning

Using trial-and-error methods and human enforcement to produce goal-oriented outcomes, such as robotics, game playing, and autonomous driving.

Naming Convention of Models

Now that you understand the types of models and their tasks, the next step is to identify the model quality and performance. This begins with the name of the models. Let’s break down a model naming. There’s the official convention for naming AI models, but the most popular ones will just have a name followed by the version number, such as chatGPT#, Claude #, Grok #, Gemini #.


However, the smaller open-source and task-specific models will have longer names. This can be seen on huggingface.co, which will contain the organization name, model name, parameter size, and lastly the context size.


Let’s elaborate with examples:


mistralai/Mistral-Small-3.1-24B-Instruct-2053

  1. Mistralai is the organization
  2. Mistral-Small is the model name
  3. 3.1 is the version number
  4. 24B-Instruct is the parameter count in billions of training data
  5. 2053 is the context size or token count


google/gemma-3-27b

  1. Google is the organization
  2. Gemma is the model name
  3. 3 is the version number
  4. 27B is the parameter size in billions


Additional details, which you will see and need to know, are the quantization format in bits. The higher the quantization format, the more computer RAM and storage are required to operate the model. A quantization format is represented in floating point, such as 4, 6, 8, and 16. Other formats can include GPTQ, NF4, and GGML, which indicate usage for specific hardware configurations.

Accuracy Performance of Models

If you’ve seen news headlines about a new model release, do not immediately trust the results that are claimed. AI performance competition is so competitive right now that companies cook up the performance numbers for marketing hype. How many people will test them on their own instead of trusting the marketing hype? Not many at all, so don’t fall for the “hallucinated figures”. References https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/ and https://lmarena.ai/?leaderboard


The real way to determine model quality is to check benchmark scores and leaderboards. There have been several tests which you could say are semi-standardized or maybe fully standardized, but in reality, we are testing “black boxes” with tons of variables. The best measure is to check answers and responses from AI with facts and other scientific sources.


Leaderboard websites will show sortable rankings with votes, confidence interval scores, usually in a percentage value. The common benchmarks are tests that prompt the AI model with questions and get answers that are measured. They can include: AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, WinoGrande, GSM8K, HumanEval.


Here are brief descriptions of the benchmarking methods:


AI2 Reasoning Challenge (ARC) – 7787 multiple-choice science questions from grade school

HellaSwag – commonsense reasoning exercises through sentence completion

MMLU – Massive multitask language understanding for problem solving

TruthfulQA – assess truthfulness by encouraging falsehoods and avoiding responses like “I’m not sure”.

WinoGrande – Winograd schema challenge with two near-identical sentences based on a trigger word

GSM8K – 8,000 grade school level math questions

HumanEval – measures the ability to generate correct Python code across 164 challenges


Leaderboard websites for reference:

https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison

https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/

https://scale.com/leaderboardhttps://artificialanalysis.ai/leaderboards/models

https://epoch.ai/data/notable-ai-models

https://openlm.ai/chatbot-arena/

https://lmarena.ai/?leaderboard


Originally published here

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks