<img src="https://secure.item0self.com/192096.png" alt="" style="display:none;">

How Should Large Language Models Be Evaluated?

A while ago Tianjin University released a study, which defined a LLM model evaluation and implementation taxonomy.

The image below shows the taxonomy of major categories and sub-categories used by the study for LLM evaluation.

Considering the taxonomy, much focus has been on Evaluation Organisation, Knowledge & Capability and Specialised LLMs from a technology perspective.

From an enterprise perspective, the concerns and questions raised centre around Alignment and Safety.

This survey aims to create an extensive perspective on how LLMs should be evaluated.

The study categorises the evaluation of LLMs into three major groups:

  1. Knowledge and Capability Evaluation,
  2. Alignment Evaluation and
  3. Safety Evaluation.

What I also found interesting from the study, was the focus on more complex metrics like reasoning and tool learning, as seen below:

In the early days of Natural Language Processing (NLP), researchers frequently utilised a series of simple benchmark assessments to assess the performance of their language models.
 
These initial assessments predominantly focused on elements such as syntax and vocabulary, including tasks like parsing syntactic structures, disambiguating word senses, and more.

LLMs have introduced significant complexity and have shown certain tendencies by revealing behaviours indicative of risks and demonstrating abilities to perform higher-order tasks in current evaluations.

Consequently, creating a taxonomy like this to reference can help ensure that due diligence is followed when assessing LLMs.

Find the full study here

Subscribe to Blog Updates

START YOUR FREE TRIAL

Build powerful Virtual Assistants using Kore.ai Experience Optimization (XO) Platform.

 
Do you already have account? Login
By clicking Continue, 'you' agree to our Terms of Service
Gen AI in the enterprise: Uncovering use cases and achieving ROI

Recent Posts

Follow us on LinkedIn

leftangle
Request a Demo Build a Virtual Assistant Resources