LLM Drift, Prompt Drift & Cascading

Written by Cobus Greyling | Sep 19, 2024 5:26:18 AM

Prompt Chaining can be performed manually or automatically; manual entails crafting chains by hand, via a GUI chain building tool. Autonomous Agents create chains on the fly as they execute while making use of the tools at their disposal. Both these approaches are susceptible to cascading, LLM & prompt Drift.

LLM Drift

LLM Drift is definite changes in LLM responses over a relatively short period of time. This is not related to LLMs being in essence non-deterministic or related to slight prompt engineering wording changes; but rather fundamental changes to the LLM.

A recent study found that over a period of four months, the response accuracy of GPT-4 and GPT-3.5 fluctuates considerably in the positive but more alarming…negatively.

The study found that both GPT-3.5 and GPT-4 varied significantly and that there was performance degradation on some tasks.

Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source

The schematic below shows the fluctuation in model accuracy over a period of four months. it some cases the deprecation is quite stark, being more than 60% loss in accuracy.

Source

Prompt Drift

The output of LLMs are non-deterministic, this means that the exact input, to the same LLM, at different times, will most probably yield different responses over time.

In essence this is not a problem, and wording can differ while the ground truth remains the same.

However, there are instances where there are aberrations in the response of the LLM. For instance, LLMs are deprecated and migration is often necessitated, as we saw recently with OpenAI deprecating a number of models. Hence the prompt remains the same, but the underlying model referenced change.

The data which is injected into the prompt at inference might also be different at times. Suffice to say all of these factors contribute to a phenomenon known as prompt drift.

Prompt Drift is the phenomenon where a prompt yields different responses over time due to model changes, model migration or changes in prompt-injection data at inference.

In short, prompt drift can be caused by:

Model-inspired tangents,
Incorrect problem extraction,
LLM randomness and creative surprises.

There has been the emergence of Prompt Management and testing interfaces, like for instance GALE introduced by Kore.ai.

There is a definite market-need to ensure generative applications (Gen-Apps) can be tested prior to large language model migration/deprecation.

And if a model could be largely agnostic to the underlying LLM, so much the better. One avenue to move closer to achieving this, is leveraging in-context learning (ICL) capabilities of Large Language Models.

Cascading

Cascading is when an aberration or deviation is introduced by one of the nodes in a chain, and this unexpected exception is carried over to the next node, where the exception will most probably be exacerbated.

With each node output deviating further-and-further from the intended outcome.

This phenomenon is commonly referred to as cascading.

Considering the image below:

The user input can be unexpected or unplanned in the chained application, hence producing an unforeseen output from the node.
The previous node’s output can be inaccurate or produce a degree of deviation which is exacerbated in the current node.
The LLM Response can also be unexpected, due to the fact that LLMs are non-deterministic. Point three is where prompt drift or LLM drift can be introduced.
And the output from Node 2 is then carried over and cascading of the deviation is caused.

In Closing

Prompt chaining should not viewed in isolation, but rather consider Prompt Engineering as a discipline which consists of several legs.

The wording or technique followed when prompting the LLM is also important and has a demonstrable effect on the quality of the output.

Prompt Engineering is the foundation of Chaining and the discipline of Prompt Engineering is very simple and accessible.

However, as the LLM landscape develops, prompts are becoming programable (templates and context injection via RAG) and incorporated into increasing complex structures.

Hence chaining are being supported by elements like Agents, Pipelines, Chain-of-Thought Reasoning, etc.

How is ChatGPT's behavior changing over time?

arxiv.org

PromptChainer

arxiv.org

Language Model Cascades

arxiv.org

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

arxiv.org

View full post