Tackling the Explainability Gap in RAG Hallucination Evals


This post was cowritten by me and was originally published on Dynamo AI’s blog.

As many of our enterprise customers move from PoC to production LLM deployment, we find that enterprises need to demonstrate robust reliability testing of their AI systems. The tendency for LLMs to “hallucinate” incorrect or inconsistent outputs remains a major challenge for enterprises at this stage.

In a recent example, Air Canada’s chatbot hallucinated information about refunds and discounts, leading to significant confusion and complaints. Moreover, for highly-regulated enterprises such as financial institutions, regulators like the Consumer Financial Protection Bureau have highlighted that “deficient chatbots” can lead to a “risk of noncompliance with federal consumer financial laws.”

Specifically, the CFPB states that a chatbot “providing inaccurate information regarding a consumer financial product or service, for example, could be catastrophic. It could lead to the assessment of inappropriate fees, which in turn could lead to worse outcomes such as default, resulting in the customer selecting an inferior option or consumer financial product, or other harms.”

While retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding outputs in retrieved passages, enterprises deploying RAG still typically see high degrees of hallucinations during their testing. To safely deploy LLMs, enterprises are beginning to widely integrate routine hallucination evaluators to measure and trace the root causes of hallucinations in their RAG pipelines.

While open-source LLM evaluators have played an important role in the evolution of this space, we find that regulated enterprises that are moving LLMs into real production environments require an enterprise-grade solution that includes more explainable metrics and alignment with regulatory standards for comprehensive red-teaming. For example, most of our customers who have experimented with open-source LLM evaluators are still left with key unresolved questions such as:

  1. Without an interpretable hallucination risk score, what is an acceptable “threshold score” for deploying LLMs into production?
  2. If my AI system is not meeting a satisfactory hallucination risk score, what actionable steps can I take to mitigate hallucinations?
  3. How can I explain the testing I’ve performed to regulators and meaningfully explain residual risk that may exist?

In this post, we’ll explore the challenges enterprises face in tackling RAG hallucinations, the limitations of existing tools, and introduce Dynamo AI’s comprehensive solution for measuring and tracking these issues.

Limitation of existing tools

While many tools exist for evaluating the degree of hallucination for RAG applications, major limitations include the following:

Dynamo AI’s RAG hallucination evaluation

Dynamo AI provides a comprehensive RAG evaluation solution that assesses model performance across multiple metrics:

  1. Retrieval relevance: Represents the relevance of the documents retrieved from the vector database using the embedding model for each query.
  2. Faithfulness: Evaluates whether the generated response is consistent with the retrieved documents.
  3. Response relevance: Determines if the generated response adequately addresses the given query.

Dynamo AI leverages purpose-built models for each evaluation task, ensuring cost-efficiency and enabling in-depth analysis. Further, the platform offers actionable insights by identifying topic clusters where the RAG pipeline underperforms and categorizing errors by issue type for in-depth analysis. To demonstrate our solution, we ran our RAG hallucination tests against the MultiDoc2Dial dataset and compared the results with RAGAS for reference.

Accurate and interpretable performance metrics

In a head-to-head comparison, DynamoEval’s RAG hallucination suite outperformed RAGAS in a classification task of identifying good/bad context/responses given a query. We measured accuracy and area under the receiver operating characteristic (AUROC) across the following metrics: Retrieval Relevance, Faithfulness, and Response Relevance. The following improvements in performance have been achieved through additional prompt optimizations and the use of performant task-specific models.

DynamoEval, unlike RAGAS, returns both the relevance/faithfulness scores and binary labels (good/bad). Test results with only the scores tend to be more ambiguous due to the difficulties associated with drawing a clear threshold demarcating good and bad.

The receiver operating characteristic (ROC) curves and the resulting AUROC values shown below demonstrate that the relevance/faithfulness scores from DyamoEval are more accurate in diagnosing Retrieval Relevance, Faithfulness, and Response Relevance.

Easier interpretation of Response Relevance test results

Easy interpretation of Retrieval Relevance test results

Investigate sources of error using topic level clustering

DynamoEval does not stop at generating classification labels and scores for each metrics, but further clusters the input queries based on different topics to provide additional insights for sources of errors and improvements. Analyzing hallucination metrics at a topic-level enables targeted data augmentation and model fine-tuning to address weak areas.

The results explored below are based on the aforementioned test between DynamoEval and RAGAS, wherein we constructed a binary classification dataset from Multidoc2dial, evaluated RAGAS and DynamoEval using accuracy and AUROC, and compared their performance. We also analyzed individual topics for their RAG metrics to dive deeper into specific areas of performance within the RAG pipeline.

For the “student scholarship” topic, Retrieval Relevance is low at 0% (0% of tested queries in this topic retrieved the correct document chunk). This suggests that there may be opportunities for improvements in the retrieval mechanism. One possible reason for the low Retrieval Relevance score could be that the vector database used in the test lacks sufficient information on student scholarships, which could be improved through the injection of additional scholarship-topic related documents to the vector database.

Another possible reason for the low Retrieval Relevance score could be that the embedding model used as part of the retriever is not performant enough to identify the correct scholarship-topic related documents, in which case additional fine-tuning of the embedding model may be necessary.

Faithfulness is also relatively low for the “disability eligibility” topic at 9%, indicating that the generator model struggles to produce information consistent with the retrieved documents, even if they are relevant. Augmenting the training data with more ground-truth, question-context-answer pairs related to disabilities could help fine-tune the generator to be more faithful.

Using the labels from the above section, we can drill deeper into our topic-specific metrics to find out whether any poor-performance metric was related to either a generator or retriever related problem. The analysis looks at combinations of Retrieval Relevance, Faithfulness, and Response Relevance to pinpoint issues.

For example, if Retrieval Relevance and Response Relevance are both high but Faithfulness is low, it may suggest that the generator is not leveraging the retrieved information properly; or if Retrieval Relevance is low but Faithfulness and Response Relevance are high, the retriever may be the source of the problem (see the example below).

In conclusion, Dynamo AI’s evaluation suite for RAG addresses two major limitations in existing tools:

  1. A lack of interpretable metrics, which is addressed via more intuitive and accurate set of classification labels and scores
  2. A lack of fine-grained, detailed analysis of the errors for actionable improvements, which is addressed with topic-level clustering and error type analysis

Comparison methodology with RAGAS

Running a DynamoEval RAG hallucination test from the SDK

Dynamo AI provides an easily configurable SDK method to set up and run the RAG hallucination test by specifying the following parameters:

How Dynamo AI can help

At Dynamo AI, we are committed to helping organizations measure and mitigate RAG hallucination effectively. Our comprehensive RAG evaluation offering provides deep insights into model performance, enabling teams to identify and address weaknesses in their RAG pipelines.

We also offer a range of AI privacy and security solutions to help you build trustworthy and responsible AI systems. To learn more about how Dynamo AI can help you evaluate and improve your RAG models, or to explore our AI privacy and security offerings, please request a demo.

· ai, llms, rag, hallucination, eval