LLMs do not have a continuous learning capability and are frozen in time.
The study shows that LLMs perform well on tasks they have seen before during training & poorer on un-seen tasks.
Currently LLMs do not reliably & continuously adapt to drifting input distribution; might this point to the relevance of the newly introduced OpenAI fingerprint functionality; and models being updated on a perpetual basis?
Hence the LLMs performance is to some degree unpredictable, as users’ can be surprised by highly accurate and succinct results. Which might seem like highly astute LLM reasoning, where in actual fact it is previously seen information.
And on the flip side, there can be a deprecation in LLM performance, where the there data has not been seen during base model training.
In recent times there has been immense focus on In-Context Learning where zero or few shot prompts are injected with highly contextual information at inference.
The study considers contamination of zero or few-shot methods, which is named task contamination, the inclusion of task training examples in the pre-training data, effectively making the evaluation no longer strictly zero or few-shot.
Zero-shot and few-shot evaluations involve models making predictions on tasks that they have never seen or seen only a few times during training.
The key premise is that the models have no prior exposure to the particular task at hand, ensuring a fair evaluation of their learning capacity.
Contamination is defined as instances where models, give a false impression of its zero or few shot competency, as they have already been trained on task examples during pre-training.
A study released on 31 Oct 2023 evaluated the most widely used Large Language Models (LLMs), including GPT-3.5 and GPT-4, in their March 2023 and June 2023 versions over a period of time.
The study covered diverse tasks such as math problems, sensitive questions, opinion surveys, multi-hop knowledge-intensive questions, code generation, visual reasoning and more. The analysis revealed significant variations in the performance and behaviour of both GPT-3.5 and GPT-4 over time.
GPT-4 accuracy in identifying prime vs. composite numbers dropped from 84% in March 2023 to 51% in June 2023.
GPT-4’s reduced ability to follow Chain-Of-Thought (CoT) prompting and a decline in answering sensitive questions contribute to these changes.
The study emphasises again the importance of continuous monitoring of LLMs due to the observed substantial changes in behaviour over a relatively short period.
…dated 26 Dec 2023 investigated how zero-shot and few-shot performance of LLMs have changed over a period of time.
We find evidence that some LLMs have seen task examples during pre-training for a range of tasks, and are therefore no longer zero or few-shot for these tasks. — Source
The study considered 12 models shown below…
LLMs from the GPT-3 series of models and several other open-sourced LLMs were used for the tests.
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.