September 15, 2025

Join us on Oct 9 for a conversation about evaluating LLMs for accuracy and inclusion

The NLP Community of Practice is organizing an event on LLMs’ evaluation, and you are welcome to join us!

On October 9, from 10 to 11.30am ET, the Gender, AI and MERL and the Sandbox Working Groups are gathering to think through how to evaluate and benchmark different aspects of LLMs and their application in development and humanitarian programming and/or when using them for crunching data and summarizing text for use in evaluations and other MERL tasks.

This event – which focuses on the MERL of AI-enabled programming – will start with working group leads, Savita Bailur and Zach Tilton, giving a short introduction to the overall evaluation arc – from LLM models to impact evaluation. Then we will go deeper with our expert speakers on the various points at which evaluation is (or should be!) happening. We plan on exploring questions such as:

How do we know if and when the use of AI is safe and accurate?
Is the use of AI tools actually making us better or more efficient in our data analysis or decision-making? How can we know?
How can we ensure we are looking at AI evaluation through a gender lens?
How can we evaluate and assess if AI is actually moving the needle of a particular development outcome? At what point do we move from evaluating the AI tools to evaluating their contributions to development impact?

We will be joined by the following experts in LLM evaluation:

Mala Kumar (from Humane Intelligence) will explain the evaluation of LLMs and how the term ‘evaluation’ is used in the tech world.
Sarah Amos (from Humane Intelligence) will explore questions related to mis- and disinformation and LLM evaluation in terms of accuracy.
Annie Brown (from Reliabl) will discuss assessing LLM safety and how to address gender and other aspects of inclusion by creating (and testing) inclusive taxonomies
Tarunima Prabhakar (from Tattle) will discuss participatory red-teaming to assess LLMs on gender-based violence and safety in general.
Temina Madon (from The Agency Fund) will talk about the evaluation process at the application, user, and impact levels and what needs to be considered at each level based on their framework for evaluation of AI in the development sector.

Please register here and join us!

Join us on Oct 9 for a conversation about evaluating LLMs for accuracy and inclusion

Leave a Reply Cancel reply

You might also like