LLM Drift is definite changes in LLM responses over a relatively short period of time. This is not related to LLMs being in essence non-deterministic or related to slight prompt engineering wording changes; but rather fundamental changes to the LLM.
A recent study found that over a period of four months, the response accuracy of GPT-4 and GPT-3.5 fluctuates considerably in the positive but more alarming…negatively.
The study found that both GPT-3.5 and GPT-4 varied significantly and that there was performance degradation on some tasks.
Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source
The schematic below shows the fluctuation in model accuracy over a period of four months. it some cases the deprecation is quite stark, being more than 60% loss in accuracy.
The output of LLMs are non-deterministic, this means that the exact input, to the same LLM, at different times, will most probably yield different responses over time.
In essence this is not a problem, and wording can differ while the ground truth remains the same.
However, there are instances where there are aberrations in the response of the LLM. For instance, LLMs are deprecated and migration is often necessitated, as we saw recently with OpenAI deprecating a number of models. Hence the prompt remains the same, but the underlying model referenced change.
The data which is injected into the prompt at inference might also be different at times. Suffice to say all of these factors contribute to a phenomenon known as prompt drift.
Prompt Drift is the phenomenon where a prompt yields different responses over time due to model changes, model migration or changes in prompt-injection data at inference.
There has been the emergence of Prompt Management and testing interfaces, like for instance GALE introduced by Kore.ai.
There is a definite market-need to ensure generative applications (Gen-Apps) can be tested prior to large language model migration/deprecation.
And if a model could be largely agnostic to the underlying LLM, so much the better. One avenue to move closer to achieving this, is leveraging in-context learning (ICL) capabilities of Large Language Models.
Cascading is when an aberration or deviation is introduced by one of the nodes in a chain, and this unexpected exception is carried over to the next node, where the exception will most probably be exacerbated.
With each node output deviating further-and-further from the intended outcome.
This phenomenon is commonly referred to as cascading.
Prompt chaining should not viewed in isolation, but rather consider Prompt Engineering as a discipline which consists of several legs.
The wording or technique followed when prompting the LLM is also important and has a demonstrable effect on the quality of the output.
Prompt Engineering is the foundation of Chaining and the discipline of Prompt Engineering is very simple and accessible.
However, as the LLM landscape develops, prompts are becoming programable (templates and context injection via RAG) and incorporated into increasing complex structures.
Hence chaining are being supported by elements like Agents, Pipelines, Chain-of-Thought Reasoning, etc.