The SELF-RAG framework trains a single arbitrary language model to adaptively retrieve passages on-demand.To generate and reflect on retrieved passages and on own generations using special tokens, called reflection tokens.
Reflection tokens are categorised into retrieval and critique tokens to indicate the need for retrieval and its generation quality respectively.
SELF-RAG uses reflection tokens to decide the need for retrieval and to self-evaluate generation quality.
Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behaviour to diverse task requirements.
The study shows that SELF-RAG significantly outperforms LLMs and also standard RAG approaches.
Considering the image below…SELF-RAG learns to retrieve, critique and generate text passages to enhance overall generation quality, factuality, and verifiability.
SELF-RAG will introduce more overhead in terms of inference. Considering the image above, the self-reflective approach to RAG introduces more points of inference.
A first step of inference is performed, with three inference steps being performed in parallel. The three results are then compared and a winner is selected for RAG inference.
Also as can be seen in the image above, out-of-domain queries are recognised as such, and the request is not serviced via retrieval, but sent directly to the LLM inference.
Considering the image blow, the question needs to be asked…
With the complexity being introduced to the RAG process, are we not reaching a point where an agent-based RAG approach will work best? An approach LlamaIndex refers to as Agentic RAG.
There has been studies where intent-based routing has been used to triage user input for the correct treatment with in a generative AI framework. Intents are merely pre-defined use-case classes.