November 17, 2025

Event recap: Evaluating LLMs for accuracy and inclusion

On 9 October, we held an event about evaluating LLMs for accuracy and inclusion, with Temina Madon (from The Agency Fund), Mala Kumar and Sarah Amos (from Humane Intelligence), and Tarunima Prabhakar (from Tattle). Our main goal was to have an open and honest discussion on what LLM evaluation looks like in the social sector today, and we were joined by over 160 practitioners from across civil society, academia, social impact, and the humanitarian sector.

Our key questions were:

How do we test LLMs for safety and accuracy?
How do we know if the use of LLMs is making us better or more efficient in our data analysis or decision-making? How can we know?
How can we ensure we are looking at LLM evaluation through a marginalized lens (e.g., gender)
How can we evaluate and assess if the use of LLMs is moving the needle of a particular development outcome?
And finally, at what point do we move from evaluating LLMs to evaluating their contributions to development impact?

Different languages of evaluation

One of our first points of consensus was that there is different language used in terms of tech evaluation of AI (“AI evals”) vs many of those in the NGO sector who approach evaluation from a socio-technical perspective. While AI engineers often think about benchmarking model performance, MERL practitioners in the NGO space likely think about impact on people. Therefore, questions of safety and accuracy are more complex than just model performance.

Temina presented Agency Fund’s four-stage evaluation framework for AI-enabled social programs:

Stage	Focus	Key Questions
1. Model Performance	Does the model work technically?	Accuracy, reliability, hallucinations, bias, safety.
2. Product Analytics	Do people actually use the tool?	Engagement, drop-off, retention, usability.
3. User Experience & Agency	Does the tool enhance or diminish a user’s agency?	Are users more confident, informed, and able to make decisions? Psychological metrics can be used here.
4. Impact Evaluation	Does the intervention change real-world outcomes?	Learning, income, health, wellbeing, etc.

Temina led us through the nuances stage by stage. For example, in Stage 1, while an LLM could be used to measure the performance of another LLM, in the social sector, it will ideally also involve a human assessing the product, to reduce bias at the model evaluation stage (although it was agreed that the human evaluation could also introduce bias). At Stage 2, these are likely to be standard metrics that startups and tech companies have to understand how users are interacting with the product (e.g., retention, drop off, etc). Stage 3 is about understanding if the product impacts the user’s beliefs, feelings, and behaviors. According to Temina, some indicators might be:

“Are they asserting themselves more with the language they’re using? Are they using “I” more than they did earlier? There are little signals we can pick up to understand whether users are experiencing agency and are developing the beliefs, feelings, or behaviors that they need to reach their goals. This is very different than just removing frictions in our workflows, which is what a lot of commercial products are targeting.”

Finally, stage 4 is the more conventional kind of social sector impact evaluation. Does the product actually improve a development outcome that governments or philanthropists really care about, e.g., health indicators, learning outcomes, changes in a household’s income or consumption? Temina also noted that cost and time differ between Stage 1 and 4: the first two stages tend to be more routine and easy to address with techniques such as A/B testing, whereas the latter two might take more time to address.

Temina introduced the idea of a “Product Nutrition Label,” which functions like a food nutrition label, showing model performance, known risks, user engagement metrics, and user empowerment or agency measures. Below is an example of a Portuguese eduplay WhatsApp chatbot product nutrition label.

Note that you can provide feedback for the CGD/Agency Fund framework here: https://forms.gle/6VnJTEpE4M5iuaiz5

Benchmarking vs Red Teaming

Mala Kumar followed, making the distinction between benchmarking and red teaming. Benchmarking involves evaluating and comparing the performance of AI systems, using tests, data sets, and metrics. Benchmarks can compare against previous versions, other models, human performance, or a predefined standard (a “golden dataset” is a term I’ve been increasingly hearing). However, Mala noted that while benchmarking is common in the tech sector, it may not work in the social sector, where contexts may be very different due to culture, language, and other socio-cultural dynamics. Benchmarking is also difficult and expensive to develop. Side note here that Ethan Mollick recently shared developing your own personal benchmarking by constantly testing across LLMs and giving the LLM a “job interview,” but this might not be feasible or sufficiently rigorous at an organizational level.

Instead of a benchmark focus, Mala and Humane Intelligence prefer to think of red teaming. Red teaming has its foundations in cybersecurity with the aim of finding system weaknesses. In the social sector, AI red teaming is used to identify weaknesses that are both unintended harms as well as malicious attacks (more examples from Tarunima and Sarah below). Mala shared the following links from Humane Intelligence: their mission and vision, more information about their programs and services, and a piece on Why AI evals need to reflect the real world.

Context and culture

Next, we had Tarunima Prabhakar, Tattle’s Co-Founder. Tattle is a civic tech organization based in India, and Tarunima noted that AI models fail disproportionately for 1) languages with less digital content 2) communities with lower tech access, and 3) groups experiencing marginalization (e.g., women, caste-oppressed demographics). Context is critical (in another piece on AI safety guardrails, she mentioned a quote from Anna Karenina, which struck me: “All happy families are alike, but all unhappy families are unhappy in their own way.” In that piece, Tarunima added: “There is general agreement on the happy/ good responses of AI, but each bad response is bad in its own way, needing deeper diagnosis to figure out how we manage and fix it. A response can be bad because it adds noise when people come to it for information. It can be bad because it is different for men vs. women. It can be bad because it encourages self-harm. It can be bad because it sounds overconfident”.

In her presentation, Tarunima gave the example that acid procurement prompts can signal gender-based attacks, but only if evaluators recognize cultural and linguistic cues. This is why participatory dataset design is critical. Tarunima shared Tattle’s own experience in training users to red-team, for example, by building Safety Benchmark data sets and testing them:

Sarah Amos from Humane Intelligence expanded on this with her experience in red teaming exercises, notably their most recent work with UNESCO, which provided a playbook on how red-teaming could be conducted to test for bias. The example below is one of the red-teaming exercises Humane Intelligence conducted:

Although such insults and threats would have occurred in a pre-GenAI world, GenAI accelerates the automation and scaling of such behaviour. In addition, by framing the request as ‘storytelling,’ a bad actor can elicit these responses and subvert protections (i.e., guardrails) that may already be developed by model owners. Anyone conducting red teaming not only needs to be trained in recognizing bias but also in communicating it and championing advocacy. For example, is the issue due to outdated information, or is it due to tone and sensitivity? If it is a chatbot giving advice on sensitive topics such as technologically mediated gender-based violence (T-GBV), is it sensitive across culture, language, and identity, and not victim-blaming?

Discussion: trust, transparency & power

While there was agreement in the discussion around a need for a socio-technical approach to LLM evaluation beyond just product evaluation, there were a few points of divergence:

It was noted that product cards and transparency tools are promising, but model cards can be parsimonious and not reveal behind-the-scenes data labelling and metadata, which may reveal more biases.
Some noted that investing in UX (user research and interface design) could improve transparency and trust. For example, there was some discussion around the sycophantic nature of LLMs and how this may lead a chatbot user to give a more positive product evaluation even if chatbot responses are incorrect. However, others noted that curbing these potentially “dark UX” patterns is only likely through regulation.
There was some discussion around who should be involved in user testing. Stages of testing (e.g., first internal, then experts, and only finally, end users) should ensure we do not increase trauma. Tarunima recommended working with women’s rights groups, psychologists, and activists to identify coded language / local metaphors reflecting harm and bias. In Sarah’s words: “If testing a T-GBV chatbot, you would want to make sure you are red-teaming it within your team, or it could be outsourced, but not directly to the population first, because you don’t want to further traumatize as a part of testing”.
During the discussion, Evangelia Berdou shared slides on equitable AI evaluation in public services prepared for the UK Evaluation Society. A key shift here is from single loop through double loop to triple loop learning, where we actually question whether AI is needed and reflect on problematic concepts such as values and power behind employing AI.

When Linda Raftree from The MERL Tech Initiative asked the “so what” question – i.e., how can we change model design to ensure inclusion as well as accuracy – the double diamond came to mind for me. In user research, in the double diamond of UX, the first diamond is “discovering and defining”, and the second diamond is “developing and delivering”. The challenge is that the same people may not be present in both diamonds. In this data2x podcast, Emily Springer notes GenAI arose primarily as a tech innovation rather than with clearly defined use cases. We need more open communication between the two groups of users (including users beyond Silicon Valley) and designers.

How do we operationalize this double diamond approach in AI for the social sector? We closed the discussion with broader questions on who within the non-profit / NGO sector has the capacity and responsibility to do LLM evaluations, and what happens with the findings once we do these evaluations. If we find that they are not working well for our needs or they contain major biases or generate harmful outputs, what do we do? What influence do we have to get companies to change when it comes to upstream harms at the LLM level? Who has the relationships with AI companies to get any change to happen to address bias and harms? Some organizations like Humane Intelligence have relationships with some of the Big AI companies, and felt companies do want to do better and are open to improving their models. Others, such as Tattle, felt there are fewer channels to be heard by companies. This raises the question of how the non-profit sector can be more organized and work together to advocate for improvements in LLMs or alternatives that are more aligned with the needs of LMICs and the values of the social sector. One thing we did agree on was that accuracy and inclusion should be complementary and not contradictory metrics in AI evaluation.

This was the first of a series of discussions we are planning on how to evaluate the impact of LLMs in the social sector. Please keep an eye out for future events and don’t hesitate to volunteer to speak if you would also like to share your findings!

Event recap: Evaluating LLMs for accuracy and inclusion

Different languages of evaluation

Benchmarking vs Red Teaming

Context and culture

Discussion: trust, transparency & power

Leave a Reply Cancel reply

You might also like