When any organisation is building an end-user facing GUI for a LLM-based personal assistant, this research is invaluable.
Users have different expectations in terms of detailed, factual and professional responses, across different intents.
Understanding the user intent is important to the LLM in knowing how to respond.
LLMs fail most in solving problems and creativity; both are difficult intents to quantify and gauge to understand what the user expectation is.
It seems like a level of disambiguation will work well for more creative and problem solving tasks. Where the dialog turn is introduced from the LLM to seek clarification from the user.
Classifying user intent is still important; perhaps not from a run-time/inference-time perspective for the time being. But classifying user intent over time will add to the success of the assistant.
With Large Language Models (LLMs) the focus has been two-fold; the first has been on various methods and approaches to benchmarking LLMs.
The second focus was on how to enhance the intelligence of LLMs and the most effective way of delivering data to the model.
The gap identified by a recent study is the user experience (UX) portion; how are users interacting, using and experiencing LLMs.
This study focuses on four key aspects:
The study develops a taxonomy of seven user intents in LLM interactions based on real-world interaction logs and human verification.
The four research questions of this study are:
Considering the image above, what I really found valuable are the user intents identified by the study. Together with 11 findings on usage frequency, user experience, and concerns with LLMs.
And finally 6 research directions for future human-AI collaboration studies are highlighted.
From the participants surveyed, hals interacting with LLMs on a daily basis and a large percentage of users interacts with LLMs on a weekly basis.
The graph below shows the users intent distribution. The top three intents are expected, using the LLM as a text assistant, information retrieval and for solving problems.
What I find interesting is the large extent to which LLMs are used for creativity and giving advice.
Text Assistant elicited the highest satisfaction rates , with over 80% reporting being satisfied or very satisfied. This suggests that this conversational service is well-suited to language-based assistance tasks.
Seeking Creativity use cases had the highest negative feedback. Around 18% of English users reported not being satisfied. This indicates that current LLMs have room for improvement when generating novel or imaginative outputs.
Below, User Usage Percentage and Rating across Intents. Usage percentage represents the ratio of users who used LLMs under each intent.
User ratings are calculated by assigning 1–5 to “very dissatisfied” — “very satisfied” and then averaging the scores of the users who participated under these intentions.
User Expectations vary greatly across scenarios, which might not always align with the current evaluation standards. Notice from the image below how the expectation of users differ based on the intent.
User concerns and desired improvements for large language models (LLMs) can be broadly classified into two categories: Capability and Trustworthiness.
There are three ways in which LLMs are made available to users.
The LLM can be made available as a raw model, or exposed as an API, or the LLM can be exposed via UI as a personal assistant.
This study can serve as an invaluable resource to build better personal assistants making use of LLMs.
Find the study here.
Previously published on Medium.