RagMetrics Newsletter
Posts
How Good is an LLM Judge?

How Good is an LLM Judge?

We followed in the footsteps of a top AI YouTuber

Alon Bochman
May 02, 2025

One of my favorite AI YouTubers is Matt Berman. Matt has posted >250 videos and amassed >240k subscribers in about a year. Impressive! Every time a new model comes out, Matt tests it by asking a few questions that are easy for people but hard for LLMs. Here are a few examples:

How many words are in your response to this prompt?
Give me 10 sentences that end in the word apple.
John and Mark are in a room with a ball, a basket and a box. John puts the ball in the box then leaves for work. While John is away, Mark puts the ball in the basket and then leaves for school. They both come back later and they don't know what happened. Where do they think the ball is?

The first question is difficult for models because they predict one word at a time (technically, one token at a time). They have no idea how many words will be in the sentence when they start it.

The second question also tests a model’s ability to plan. Most models can easily write a single sentence that ends with the target word, but lose the thread after 2-3 sentences.

The third requires a bit of empathy. You need to get into the head of the two characters and think through what they separately know. Easy for you and me. Not so for a stochastic parrot.

Matt’s questions make for a good LLM benchmark:

They are reasonably objective: there’s a clear right answer.
They are hard enough to challenge SOTA models.
They are less likely to suffer from contamination than well-known benchmarks like MMLU or BBH. Contamination is when model makers include famous questions and answers in their training data, a bit like cheating on a test.

With Matt’s permission, we automated his benchmark so that we could apply it to any new model as it comes out. We started with six of the top models on the LMSys ChatBot Arena leaderboard. Here are our results:

OpenAI’s GPT4 got 82% of the questions right, taking the lead. Llama3 and Mixtral tied for second. Google’s Gemini Pro 1.5 was surprisingly weak, getting only half the question right. GPT3.5 was last. The chart also shows 95% confidence intervals for each generator model.

We thought this was a good first step, but humans still need to grade the answers, which would limit our scalability. What if we ask LLMs to grade themselves?

We tried five different judge models against the six generator models above. We compared their grades against the human grades. As a last step, we tried the same LLM judges within the ragmetrics.ai framework:

LLM judges achieve pretty good agreement with human judges out of the box. For example, GPT4’s grades match human grades 86.4% of the time, slightly higher than the 80% agreement rate found in a prior study. The results get even better with RagMetrics, our platform for model evaluation.

If you are building an LLM application and would like to automate your evaluation, reach out to us at ragmetrics.ai for expert assistance.

Appendix 1: Question List

Here are the questions included in this Berman benchmark:

If we lay five shirts out in the sun and it takes 4 hours to dry how long would 20 shirts take to dry? Explain your reasoning step by step.
Jane is faster than Joe. Joe is faster than Sam. Is Sam faster than Jane? Explain your reasoning step by step.
4 + 4 equals.
25 - 4 * 2 + 3 equals.
How many words are in your response to this prompt?
There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left?
Create JSON for the following: There are three people, two males. One is named Mark. Another is named Joe and a third person who's a woman named Sam. The woman is aged 30 and the two men are both 19.
Assume the laws of physics on earth. A small marble is put into a normal cup and the cup is placed upside down on the table someone then takes the cup and puts it inside the microwave where's the marble now.
John and Mark are in a room with a ball, a basket and a box. John puts the ball in the box then leaves for work. While John is away, Mark puts the ball in the basket and then leaves for school. They both come back later and they don't know what happened. Where do they think the ball is?
Give me 10 sentences that end in the word apple.
It takes one person 5 hours to dig a 10-ft hole in the ground. How long would it take 50 people to dig a single 10-ft hole.

Appendix 2: Example Question/Answer/Grade

Here is an example of how an LLM generator model answers a question, and how an LLM judge grades that answer:

Do you want more information about How to Use and Implement LLM Judges? Contact us at [email protected]

Reply

or to participate.