Hallucination has become a catch-all phrase for when the model generates responses which are incorrect or fabricated. Being able to measure hallucination is a first step in managing it.
As seen in recent times, it is very interesting to see what an equaliser RAG is, and how the disparity in model performance is much lower when RAG is introduced, as apposed to the absence of RAG.
What I like about this approach is the focus on different tasks LLMs may be used for, ranging from chat, to summarisation and more.
This is a practical benchmark useful for Enterprise Generative AI teams, which need to cater to the variability in task types. For instance, a model that works well for chat, might not be great at text summarisation.
The study also refers to the power of context, and that hallucination benchmarks need to take into consideration context. Retrieval augmented generation (RAG) has been popularised as an avenue to provide a contextual reference for LLMs at inference time.
Granted there is nuance with regard to the quality of the context, but measuring variability in LLM performance across RAG vs non-RAG tasks is critical.
Retrieval-Augmented Generation (RAG) refers to a hybrid approach that combines elements of both retrieval-based information and generative knowledge from the LLM.
In the context of large language models (LLMs) like GPT-3, retrieval-augmented generation typically involves using a two-step process:
The Short Context RAG seeks to identify the most efficient model for understanding contexts up to 5k tokens.
Its primary goal is to detect any loss of information or reasoning capability within these contexts.
Similar to referencing select pages in a book, this method is especially suitable for tasks that demand domain-specific knowledge.
The Medium Context RAG aims to determine the most effective model for comprehending long contexts spanning from 5k to 25k tokens.
It focuses on identifying any loss of information and reasoning ability within these extensive contexts. Additionally, we experiment with a prompting technique known as Chain-of-Note to improve performance as it has worked for short context.
This task is akin to doing RAG on a few book chapters.
This heatmap shows the model's ability to recall information in different parts of the context.
The x-axis represents the length of the context during the experiment, and the y-axis represents the location of the information. Green indicates successful recall, while red indicates failure.
Find the report here.