Old Problems, New Tech? Making Sense of AI Chatbot Performance Metrics


Since the ability to integrate Large Language Models and Generative AI into development and humanitarian interventions became possible in 2023, the sector has dedicated substantial and justified efforts to defining how AI models themselves can be evaluated, with a focus on metrics including reliability, relevance, robustness and safety. Less attention has been paid to the ‘messy middle’ – how model performance and other contextual, human, and technical factors impact on AI-powered product usage, and how this can have a knock-on effect on impact outcomes.

On May 12th, our AI Community of Practice convened to discuss the topic of AI product metrics, with a specific focus on chatbots designed to support Social and Behaviour Change. In conversation with Caryl Feldacker (Gates Foundation), Chelsea McKevitt (GSMA) and Nicola Harford (iMedia Associates), we asked if and how standardised metrics for AI-powered chatbots might usefully be introduced, in a way that leaves room for the specificity of different interventions and deployment contexts.  As Caryl Feldacker noted: “We want to use [this data] to suggest decisions for policy and best practice, so that we can be one voice, and ultimately decide what chatbot might be working better, when, for whom, where and why.”  

Different tech, same issues, new avenues

The challenge of meaningful product metrics is not unique to AI, but is something the sector has been grappling with ever since the mobile boom allowed us to reach actors in LMICs without physically engaging with them. Terms like “reach”, “engagement” and even “active user” can mean different things to different implementers and can also vary within organisations depending on the tool being deployed and the tech stack used to build it and collect data. This is not an “AI” problem, but a “digital development” problem, which has been exacerbated by the rush on AI-centred deployments, and rendered more challenging by the many ways in which AI might be being deployed, including under the umbrella of one single intervention.

Nicola Harford also reminded us that those deploying analog media, including radio and tv shows developed to communicate complex social and behaviour change messaging, also struggle to make sense of the opaque space between “ x people listened / watched/ started a chat” and “x people demonstrated outcome level change.” The irony of digital interventions is that whilst the medium has provided us with a trail of enticing ‘breadcrumbs’ or digital traces, to help us better tell that story of (potential) change, we find ourselves overwhelmed by the data available: what is accurate, what is useful, what is meaningful, what is misleading, what it all means.

This partly explains why our sector has become polarised in its approach to monitoring digital interventions: emphasising metrics indicative of reach (which correlate to analog metrics such as “listeners”), and then jumping straight to outcome metrics without having spent time making sense of product engagement data and what it might tell us about our users, our assumptions, and the likelihood of even having any impact. 

To an extent, introducing AI functionality does little to shift this existing challenge, it just makes things more confusing, not least because every conversation with an AI-powered chatbot may be subtly different, and therefore lead to subtly different engagement (and outcome) metrics. The conversational nature of AI-powered interactions means that standard ‘funnel’ metrics don’t map as cleanly onto potentially open-ended dialogue: a long conversation might signal interest, or it might signal that the chatbot is failing to answer a question efficiently.

On the other hand, the richness offered by primarily conversational data does offer something genuinely new. Where other digital tools such as websites, apps, or ‘dumb’ chatbots give us ‘exposure’ metrics (whether a user saw a message, pressed a button, completed a module), AI-driven conversations open new avenues for qualitative understanding, including tantalising glimpses into how users interpret and react to our messaging in the moment. However, extracting meaningful insights from this new set of digital traces requires substantial analytical capacity in natural language processing, sentiment analysis or behavioural inference, that many implementing teams struggle to support. 

The importance of contextually responsive metrics

During the discussions that followed the speaker’s points, attendees with experience managing portfolios with multiple AI tools (or digital ones in general) noted that it is common for patterns to start to emerge across tools serving similar use cases or where goals were the same.  However, they stressed that any attempt to develop standardized metrics or benchmarks for SBC chatbots needs to be adapted to the intended use case, to ensure that we are not applying irrelevant or unfair standards when it comes to evaluating product successes and failures. 

For example, an AI-powered chatbot used by farmers may see patterns of usage that could be dictated by seasonal events, or influenced by rapidly-changing and unpredictable local or global events. Expected metrics and benchmarks for such a tool, including common ones such as session duration, repeat visit rates, or questions per session, may not be relevant for a different use case such as maternal health, even where AI is being used in a similar way (for example, to answer questions). 

Even were we to adapt and adopt ‘gold standard’ metrics built around specific use cases, we also need to consider how context of use can introduce variability. A farmer in Kenya, and one in India, will not necessarily have the same usage behaviour, due to factors including connectivity, social norms, digital literacy, the realities of the agricultural ecosystem and of course, gender norms. Who is actually using the tool is another variable that needs to be catered for. Particularly for interventions where gender is a strong component, understanding digital usage constraints and patterns (for example, whether usage is supervised or even delegated) can have a bearing on the relevancy of metrics.

Product performance metrics should be rooted in formative research

This also ties back to the importance of formative research (conducted before a commitment to a particular digital intervention has been made) and design research (carried out throughout the digital design process and, ideally, repeated cyclically). By reaching a solid understanding of the envisaged users, and the systems they’re operating within, we have a much better chance of right-sizing the metrics we use to evaluate both product performance, and impact.

In addition, as Chelsea noted, it’s incredibly important to develop a Theory of Change which ties all these insights together into a social and behavioral model that attempts to ‘systematise’ how exactly a digital intervention (and any other parallel initiatives) will ultimately lead to measurable impact. Importantly, though, this framework needs to be reality-tested once the product is live with the intended audience, and if need be, adjusted in line with real world usage. As Chelsea noted: “the key is to be extremely flexible, start very early, and then keep bringing up the Theory of Change and using it as a tool to say, hey, now that we’re pivoting, is this actually going to affect your user journey? Is there a bit of impact here that’s not happening or happening sooner?”

Any attempt to develop consensus around AI chatbot metrics must recognise that both granularity and flexibility need to be built in by funders and implementers alike.

Sustainability and Minimum Viable Data

As well as being responsive to the context of the intended users, participants also discussed the importance of product metrics being responsive to the context of the implementing organisation. As mentioned above, one of the biggest challenges can be making sense of the potentially vast volumes of quantitative and qualitative data generated by the conversational nature of genAI-powered tools. In parallel, funders may also put pressure on implementers to deliver results as soon as possible, in an understandable attempt to learn from, and justify investments – sometimes leaning on rigid (and outdated) evaluation frameworks to do so.

One alternative for grantmakers and implementers, raised by the GSMA team, is to be led by expected user journeys, and assumptions around them. From this, teams can define the data trail they would expect to see. Additionally, honing in on a small amount of the most useful indicators can help teams enormously: “We don’t have our grantees report on more than 10 KPIs total for the whole project because we feel that focusing on what we think are the most important ones puts the entire project team on the right track.”

Nicola also made a plea to revisit the importance of qualitative data, suggesting that many attitudinal and behavioral predictors sit hidden within conversational data. Rather than focusing on traditionally commercial indicators such as time spent online or number of repeat visits, we could develop new, potentially AI-assisted, ways to measure the quality of a conversation in real-time, which, in combination with a limited set of meaningful quantitative indicators, could help us build an early picture of proximal impact outcomes. 

The need for metrics and insights at this ‘messy middle’ stage is clear – but whether they can be standardized as part of an effort to consolidate our collective learning around this new technology is still an open question. What is clear is that we need to more deeply explore this space to better understand what might be standard and what needs to vary based on audience, context, and purpose of an GenAI application.

If you run an AI-powered chatbot and would be interested in sharing your approach to AI product performance metrics in a closed, safe space with other practitioners, let us know!  We will be convening a follow-up workshop with a select group of implementers across health, agriculture and livelihoods, to share common metrics and understand the extent to which we can establish benchmarks which can help us make sense of our own tools’ performance.

Leave a Reply

Your email address will not be published. Required fields are marked *