Evaluating RAG using LLM-as-a-Judge Independent Evaluation, Automatic Insights Collection

The entire approach is open-sourced and is accessible as a GitHub repository.

Evaluation?

In my latest article, “Chatting with Large PDFs,” I discussed the technical aspects of implementing a RAG for your PDFs to ensure the model has all the necessary knowledge to answer questions about those PDFs, regardless of their size. 

From what our experts could tell, RAG performed well; it retrieved the necessary context and generated visually appealing answers, referencing specific paragraphs of the PDFs and quoting and referring to them. 

But wait, how well? How would we know which part of the RAG pipeline is struggling? How do we iterate over RAG improvements? How can we determine if one RAG iteration is statistically superior?

Evaluation!

I daresay it’s impossible to further develop RAG without answering all the questions above. And only one thing could help with it— its name is “Evaluation.” Nowadays, it’s relatively easy to implement RAG for yourself; perhaps you could even pick a pre-baked MCP for that or develop one using some fancy-schmancy agentic libraries everyone talks about. Whichever approach you choose, and your RAG is up and running, face it: there are no shortcuts for Evaluation. 

Yes, I’ve heard of Ragas, a RAG evaluation framework, but I’ve also heard of its low quality and unheard-of greed for tokens. One Reddit user reported that Ragas consumed $500 worth of tokens on a single pass, although he didn’t share how many documents, questions, etc., his eval dataset consisted of; $500 is a lot, nonetheless. Spoiler alert: the approach presented in the article costs approximately $3, assuming the dataset consists of 10 PDFs, each with 10 pages and 19 questions per file. The cost could also be adjusted on demand by choosing different LLMs.

With all things said, chances are, there’s a need to develop Evaluation, as said, from scratch.

Evaluation of RAG is always case-centric, aligned to what’s expected from RAG, and then comparing “expected” to “real.” Subtract one from another, and you will get a metric.

What is our case at COXIT? Imagine a person, not really a computer savant, who was previously instructed how to use AI Chats and is eager to try them on daily tasks. Now, that person uploaded a PDF, a small one, a large one, or a really large one; RAG consumed that PDF, digested it, vectorized it, and indexed it in nearly ~1 minute, depending on document size. Here comes a question from the user: a composite with numerous technical details about processes, measurements, technologies, and manufacturers, all in one single query. 

As it’s our case, not somebody else’s, we can assume that an LLM of our choice, given a whole document as a context, will be able to produce a perfect response—the one that the user expects and will be happy with. We assume that because we know the average file in the dataset is 10-30 pages long, it is not large enough to cause the model to forget context. That perfect response—in other words, the “Golden” response—will be our target. The response answers all questions in the user’s query and explicitly states that some other questions cannot be answered—as they aren’t mentioned in the document. Once again, subtracting the Golden Response from the RAG response—comes metric.

Evaluation. How?

That “subtraction” produces a target metric that can be achieved in many ways; LLM-as-a-Judge (explored in “Judging LLM-as-a-Judge” and “A Survey on LLM-as-a-Judge”) is among those, and I chose it specifically to evaluate RAG in our case.

LLM-as-a-Judge—is a fully automated evaluation technique that uses LLMs to evaluate other LLMs’ responses. Two primary approaches of LLM-as-a-Judge are Comparative Evaluation (Golden Answer vs Predicted Answer) and Independent Evaluation, where both Golden Answers and Independent Answers are evaluated independently. In this article, I will refer to independent evaluation; however, both independent and comparative evaluations are necessary for a well-balanced LLM-as-a-Judge application.

Figure 1: LLM-as-Judge variations

Golden Answers, Predicted Answers, where do they all come from? 

Both of them are derived from questions. And questions are a tricky part of this story. You can have them beforehand, as we did, or you could generate a set of questions using LLM. As for Comparative evaluation and Independent Evaluation, both real-world user-composed and machine-generated questions are needed for one fair and balanced evaluation. However, once again, machine-generated questions won’t be covered or further mentioned in this article, as they require additional significant time spent. 

For a combination (Golden Answers, User-composed Questions) within a Single document that could be freely fit into the model’s context, golden answers are swiftly ejected. Simply feed the model of the said document, ask a question, and embed it into a prompt of your choice. You will then receive the said golden answer for each question in your dataset. Token spending is significant in this case; however, you could consider generating golden answers only once per dataset, in which case it won’t be a big deal.

Figure 2: Example conversation with Golden Answer generation

Interestingly enough, for Predicted Answers—for RAG Answers—things are more straightforward; we should just let the Chat chat with itself, emulating what happens when a real person sends a question in a chat. Along with the system and user message—our question—the chat receives a list of tools it can use to retrieve context. Limited by chat_turns_left, chat can retrieve context freely until it’s ready to answer user questions, but it’s out of turns. If out of turns—we just forbid chat from calling any more tools. Rinse and repeat for each file and all the questions.

Figure 3: Example conversation with Predicted Answer generation

Ultimately, we end up with a set of (Golden Answer, Predicted Answer) for each combination of (Question, PDF) in your dataset. Those are ready to be evaluated!

Evaluation. Evaluation

In a previous section, I mentioned that although both Comparative and Independent evaluations are suggested for well-rounded LLM-as-a-Judge evaluation, we will only cover Independent evaluation in this article.

Remember when I mentioned earlier in a text that in our use case, users provide not only simple questions but also queries that contain multiple questions in their text? For evaluation, we are to evaluate questions that are split from this query. Let’s call those—Split Questions— derivatives of user queries.

For the Independent evaluation of each pair of (Golden Answer, Predicted Answer), each one will be evaluated independently: an LLM call with a specialized system prompt receives question text, answer text, and is asked to assign 4 boolean values: is_question_answered, requires_additional_information, is_speculative, is_confident. I believe that asking the model to assign a numeric value from the scale of 0 to 10 is useless, and boolean values will be much more honest.

Considering that we operate Split Questions in evaluation, for each user query in the initial evaluation dataset, we receive N evaluation results—N arrays of 4 boolean values. Those 4 boolean values are then transformed into a target boolean value—comprehensive_answer, when the question is answered, doesn’t require additional information, is not speculative, and is confident. 

Resulting sets of boolean comprehensive_answer arrays must have the exact dimensions for both Golden and Predicted Answers.

Figure 4: System prompt to evaluate Golden / Predicted Answer

Evaluation. Metrics

Even though we passed the Golden and Predicted answers generation and the LLM-as-a-Judge evaluation, it’s not enough to call it a day. The next step is to collect metrics and decide if they are statistically significant. In our case, I choose the following metrics:

  • Accuracy: Percentage of correct predictions overall
  • Precision: Percentage of true positives among all positive predictions
  • Recall: Percentage of true positives captured from all actual positives
  • F1: Harmonic mean of precision and recall
  • Cohen’s Cappa: Agreement between predictions and gold standard beyond random chance

Specifically, because we didn’t have a lot of questions per file—19 user queries that were later transformed into ~67 Split Questions—I needed to add Confidence Intervals for each metric. Confidence intervals are even more critical when looking at metrics on file or question levels instead of an overall level. Said confidence intervals are a fairly easy-to-interpret metric—it means that 95% of values of a metric lie in the following interval: e.g., from 0.2 to 0.6. The first value of that interval also shows a minimal certain performance of an approach. If the lower value of the interval is 0.7 accuracy, it means the approach almost never produces results of lower accuracy than 0.7.

Confidence intervals are truly indispensable when comparing one experimental RAG approach to another. They allow us to compare not only one value of another but also the distribution of those values. In other words, if the accuracy of experiment 1 is greater than the accuracy of experiment 2, it wouldn’t suffice to call experiment 1–more successful because if CI shows significant overlap e1: [0.5, 0.6], e2: [0.52, 0.65], it gets clear, that difference in results is rather caused by a random variation then in actual difference in performance. 

In the same way, Confidence intervals express when a metric’s results are rather random than confident: when the span between lower and upper values of the interval is too big. For accuracy, 0.4 would be enough to “ignore” the results of an experiment.

At this point, when metrics are collected—as a main “deliverable” of an experiment, it’s worth mentioning that across all previous steps where LLM calls were present, we were constantly collecting logs of Chat Messages and Responses. Instead of “just” being an addition to debugging, I also found logs to have greater usage. To extract something valuable from logs, you’d need an engineer who needs to dedicate some time. But what if that could be automated? In short, yes, it could and should be automated.

Evaluation. Insights

With metrics collected and logs sitting in their pile after the experiment was conducted, it’s all peaceful there. Someone should go check those logs, but machines should work, and humans should rest.

I tried feeding my well-structured logs, specifically the logs of the LLM-as-a-Judge evaluator, when it, as you recall, took question and answer text and applied those 4 binary labels into an LLM, grouped by file, and put it into a single model call along with Metrics collected from previous steps and some extra logs. It turns out I shouldn’t check the depth of logs unless I see something unexpected in the Insights output. Without human intervention, without me having to check the logs, the model already handles producing insights into the experiment quite well. So, when my evaluation run ends, instead of going to metrics and then to logs, I look up the Insights summary produced by the LLM and only then check the logs for specific things or cases that the model has pointed out.

Those File-level insights are then once again summarized into a human-readable markdown format, collected, and put into a single prompt to deliver high-level, human-comprehendible notes from LLM. 

Figure 5: Example Analysis Report, Model Generated

Conclusion and Future Thoughts

You just witnessed my specifically tailored implementation of LLM-as-a-Judge evaluation. Even though it already delivers positive values, I want to address improvements and advancements to be considered in the approach: introduce Comparative Evaluation alongside Independent Evaluations; in that way, the model would get to compare Golden and Predicted Results directly, which could deliver robust metrics and insights; introduce machine-generated, human-answered questions which would bring more positive diversity to the question set; current independent evaluation does not handle cases positively when question just cannot be answered: Golden Answers are certain, as model seen all the document, while Predicted Answers are way less confident as they think that they failed to extract correct information. When the quirk above is adjusted, and Comparative Evaluation is implemented, that approach becomes solid and will bring positive business value.

Related articles

//LLM

Chatting with Large PDFs (100–500 Pages): Using RAG with OpenAI Embeddings (Local vs. API)

//LLM

Experiments with different LMMs

//Prompting

Prompt Engineering through Structured Instructions and Advanced Techniques