With one of the Google LLM launches, Gemini Ultra was shown as having exceeded human expert level using the MMLU measurement.
Considering the image below, I can remember in 2017 when it was illustrated how the word accuracy rate of Google ML Voice Recognition exceeded that of human level. With the human accuracy threshold being at 95%, and Google ASR reaching 95%+
Fast forward six years, and Large Language Models (LLMs) are reaching human expert levels according to the MMLU measurement.
But not only that, LLMs incorporate reasoning, in-context learning, are general knowledge intensive NLP systems, which can also be trained on specific knowledge, with natural language generation (NLG) capabilities.
MMLU is underpinned by a massive multitask test-set consisting of multiple-choice questions from various branches of knowledge.
The test spans the humanities, social sciences, hard sciences, and other important areas; totalling 57 tasks.
The 57 tasks are spread over 15,908 questions in total, which are split into a few-shot development set, a validation set, and a test set.
Considering the image below, the study found that GPT-3s confidence is a poor estimator of accuracy and the model can be off by up to 24%.
We speculate that is in part because GPT-3 acquires declarative knowledge more readily than procedural knowledge. — Source
The study found that GPT-3 is aware of certain facts; hence being trained on those facts, but GPT-3 does not apply the facts in reasoning. These findings are such a good explanation why In-Context Learning (ICL) can so effectively be applied via a chain-of-thought process.
The study’s comments on multi-modal models are also interesting…
While text can effectively convey a vast array of ideas about the world, numerous crucial concepts rely heavily on alternative modalities like images, audio, and physical interaction.
With the advancement of models in processing multimodal inputs, benchmarks need to adapt to this evolution.
A potential benchmark for this purpose is a “Turk Test,” comprising of Amazon Mechanical Turk Human Intelligence Tasks. These tasks are well-defined and demand models to engage with flexible formats, showcasing their understanding of multimodal inputs.
Here is a summary of model limitations found by the study…
Read more here.