There are a number of considerations to take into account:
It needs to be noted that this study from Microsoft made use of unsupervised fine-tuning, which means the data was not annotated.
The results from the current events tasks clearly demonstrate a significant advantage for RAG over fine-tuning. While fine-tuning did enhance results compared to the base model in many instances, it fell short of competing with the effectiveness of the RAG approach.
Several factors likely contribute to this discrepancy.
Firstly, RAG not only enriches the model with knowledge but also incorporates context that is pertinent to the question, a capability lacking in fine-tuning.
Secondly, fine-tuning may adversely affect other aspects of the model due to a phenomenon known as catastrophic forgetting.
Thirdly, it’s conceivable that unsupervised fine-tuned models could benefit from additional alignment through supervised fine-tuning, as illustrated by the markedly improved performance of Orca2 over the base Llama2.
Considering the graph above, the relative accuracy gains for each knowledge-injection method is shown…what is evident is how much it differs between models.
What is also interesting is how RAG performs, and Fine-Tuning combined with RAG does not always outperform a single approach of RAG or Fine-Tuning.
Some aspects of this study warrant further research. For example, for fine-tuning the study focussed on unsupervised training as the primary fine-tuning method, as opposed to instruction-tuning or supervised fine-tuning.
Doing the research over other LLMs can also yield interesting results.
Find the study here.