GenAI for Development Needs Its Own Evaluation Standards Before It’s Too Late


There is a pattern in the history of global development that we keep repeating, and we are repeating it right now with GenAI.

The pattern goes like this. A new paradigm arrives, a compelling idea about how to address poverty, disease, inequality. Money flows. Governments and foundations and NGOs rush to fund, implement, and report. Programmes scale, and implementation accelerates. But the systems for understanding whether any of it works: how, for whom, and at what cost, arrive late, if at all.

We have lived through this cycle with structural adjustment, with microcredit, with mobile money, with ICT4D. We are living through it again with GenAI for development. And the stakes, this time, are higher than they have been before.

How evaluation evolved in development

The story of how global development came to rely on its current evaluation standards is not one of foresight, but of accumulated failure.

Figure 1. The evolution of evaluation in global development

In the decades following World War II, reconstruction aid flowed primarily through bilateral relationships with minimal harmonisation requirements. Development was largely treated as a technical problem, the transfer of capital, expertise, and infrastructure from richer to poorer countries, and the question of whether it was working was answered by outputs: roads built, vaccines distributed, schools constructed. Whether those outputs translated into improved lives was rarely asked systematically, and almost never answered independently.

The participatory development movements of the 1970s challenged this logic. Theorists and practitioners from the Global Majority, Robert Chambers, Paulo Freire, and many others, argued that development done to communities rather than with them was not only ineffective but potentially harmful. Community agency, local knowledge, and contextual validity were not soft concerns; they were determinants of whether interventions worked or failed. But the evaluation infrastructure of the time had no real way to capture these dimensions. Logframes could count outputs. They could not assess whether the people being served were better off in their own terms.

The 1980s brought a harder lesson. Structural adjustment policies, imposed largely through conditionality attached to World Bank and IMF loans, shifted responsibility for social services to local governments without providing the resources or capacity to deliver them. The outcomes were, in many cases, catastrophic: collapsed health systems, rising inequality, damaged public institutions. And yet the evaluation systems in place largely registered these programmes as successful against their own narrow indicators. The accountability gap was not incidental. It was structural. The frameworks being used to measure success had been designed by the institutions implementing the interventions.

By the 1990s and 2000s, the field could no longer ignore the gap between investment and evidence. A series of international commitments: the Rio Earth Summit’s Local Agenda 21, the Paris Declaration on Aid Effectiveness, the Accra Agenda for Action, and the Grand Bargain, tried to address it, establishing principles of ownership, alignment, harmonisation, and mutual accountability. And in parallel, the OECD Development Assistance Committee developed what became the field’s first widely adopted evaluative criteria: relevance, effectiveness, efficiency, impact, sustainability, and later joined by coherence.

These criteria were not perfect. They were designed for bilateral aid programmes with defined beneficiary populations and stable theories of change. They embed, as critics have rightly noted, a donor-centric logic in which relevance is partly defined by funder priorities, and sustainability is framed around what happens when donor funding ends, implicitly accepting dependency as the baseline condition. The African Evaluation Association, UNEG, and a growing body of scholars from the Global Majority have all pushed back on the limitations of OECD-DAC criteria for contexts that were never consulted in their design.

But imperfect as they are, these criteria did something essential: they gave the field a shared language for evaluative judgement. They established a floor, a minimum expectation that interventions would be assessed not just on what they produced but on whether they were the right thing to do, whether they worked, and whether the results lasted. That shared language made it possible to compare across programmes, to hold funders and implementers accountable, and to accumulate evidence over time about what works and what doesn’t.

The AI for development field does not yet have this. And it is spending money as if it does. The lesson from development history is not that evaluation frameworks prevent failure. They do not. The lesson is that with no shared standards for asking hard questions, failure becomes much harder to see, much harder to learn from, and much harder to correct.

This is the moment GenAI for development finds itself in. The technology is advancing rapidly. Investment is accelerating. But the field has not yet agreed on the standards needed to determine whether these interventions are creating meaningful and sustained value.

The scale of what is happening

To understand the stakes, consider the landscape. In the past three years alone, major foundations including Gates, Wellcome, Rockefeller, and Omidyar have committed hundreds of millions of dollars to GenAI for development initiatives. Google.org has launched AI collaborative funds across health, agriculture, and education. USAID, before its decimation in 2025, had made AI central to its digital development strategy. The GSMA has multiple challenge funds specifically focused on GenAI for women’s economic empowerment in sub-Saharan Africa. Development finance institutions are beginning to treat GenAI not just as a tool for their grantees but as an investment asset class.

The tools being funded cut across clinical decision support, agricultural advisory chatbots, legal aid platforms, financial inclusion services, maternal health companions, and educational tutors. They are being deployed in dozens of languages across dozens of countries, serving populations with widely varying levels of digital literacy, connectivity, and trust in automated systems.

And the evaluation frameworks being used to assess them? In most cases, they are either borrowed from commercial product development, engagement metrics, retention rates, A/B testing, or applied from traditional development evaluation in ways that were never designed for adaptive, technology-mediated interventions. Neither is sufficient. Both can be misleading.

The AI Evaluation Playbook and its limits

This gap is beginning to receive attention. The emergence of the Generative AI Evaluation Playbook represents an important step towards building a shared approach for assessing GenAI interventions in development. It is an attempt to answer a question the field urgently needs to confront: how do we move beyond demonstrating that an AI tool functions and begin understanding whether it is actually creating value?

The Generative AI Evaluation Playbook, developed by Center for Global Development and Agency Fund, is a genuine contribution. It is the most serious attempt yet to build shared evaluation standards for GenAI interventions in the development sector. Its four-level framework (model evaluation, product evaluation, user evaluation, and impact evaluation) does something important: it puts conversations that usually happen in separate silos into a single coherent structure. Normally, tech teams who obsess over model accuracy talk past programme managers who worry about adoption, who in turn talk past evaluators focused on impact. The Playbook forces them to look at each level as a dependency of the others.

The Minimum Viable Evaluation section is particularly valuable. In a sector where most organisations do not have the resources or expertise for a full four-level evaluation, it provides a practical baseline of “here is the minimum you should be doing,” addressing a real and pressing need.

The Playbook represents an important foundation. But if the goal is to establish evaluation standards for GenAI in development, three additional dimensions need to be addressed.

1. Clarifying what kind of evidence each level produces.

The first issue is conceptual clarity. Before building evaluation standards, the field needs greater precision about what different forms of evidence can tell us, and what claims they can legitimately support.

The Playbook is often referred to as the “AI Evaluation Playbook,” implying that it covers all kinds of AI, yet it is mainly applicable to GenAI, specifically to chatbots. Other playbooks are likely needed to cover other kinds of (Gen)AI, for example, frontline worker tools, clinical decision-making support bots, computer vision and medical diagnostics, deep tech, and the development of additional language models or wider systems approaches that involve AI.

In addition, the Playbook describes its framework as four levels of evaluation. But, while it’s common to hear Level 1 referred to as ‘eval’ in the world of AI, not all four levels are considered evaluation in a conventional, development context.

  • Level 1 is model testing and assessment, an engineering activity concerned with whether the AI system produces accurate, safe, and consistent outputs.
  • Level 2 is product monitoring and optimisation, a product management activity focused on whether users are engaging and whether the product performs as designed.
  • Level 3 is user outcome monitoring, which sits closer to programme monitoring than evaluation, tracking whether the product is changing users’ knowledge or behaviour.
  • Level 4 is the only level that constitutes summative evaluation in any meaningful professional sense: an independent judgement, against defined criteria, about whether an intervention is working, for whom, and whether it should continue.

Calling all four of these “evaluation” obscures what each level can and cannot claim. It risks giving implementing organisations false confidence that running product analytics meets accountability obligations. And it risks giving funders the impression they are receiving evaluation evidence when they are receiving performance data. The Playbook’s own architecture would be stronger, and more honest, if each level were labelled by its primary function: what kind of evidence it produces, and what claims it can legitimately support.

2. Adding an evaluative framework for judgement.

The second issue is that measurement and evaluation are not the same thing. The Playbook provides important guidance on what should be measured and assessed at different stages of an AI intervention, but evaluation also requires a framework for making judgements: whether an intervention is relevant, whether it is producing meaningful results, for whom, under what conditions, and whether those results justify continued investment.

A development evaluator reading the Playbook will find limited guidance on the evaluative criteria or normative framework needed to interpret evidence. Established development evaluation frameworks, including the OECD-DAC criteria, UNEG Norms and Standards, African Evaluation Association Guidelines, do not currently form part of the framework. The result is that the Playbook provides important measurement architecture but does not yet provide the evaluative language needed to make judgements about what the evidence means.

A related issue concerns sequencing. The framework largely follows product development logic rather than evaluation logic: it begins with assessing model performance before establishing whether the model and the product are the right solution for the right problem. In development evaluation, effectiveness cannot be considered separately from relevance. A chatbot can perform well against technical accuracy benchmarks while still being poorly aligned with what communities need, whether they trust it, or whether it addresses the underlying problem. The field learned this lesson during the ICT4D era: technology can function as intended while adoption remains limited and intended outcomes fail to materialise. GenAI evaluation risks repeating that pattern unless relevance and contextual fit are considered before technical performance.

3. Ensuring sustainability is treated as an evaluation question.

The third issue is sustainability. Development experience has repeatedly shown that short-term effectiveness does not necessarily translate into lasting impact. AI interventions are no exception.

Sustainability is a core development evaluation criterion precisely because experience has shown, repeatedly, that interventions that produce results during the funding period often do not last. For AI tools this is not an abstract concern. The inference costs of running a language model don’t disappear when a grant ends. The institutional capacity to maintain, update, and govern an AI system does not emerge spontaneously. The community trust required for sustained adoption takes years to build and can be destroyed by a single harmful output. An evaluation framework that does not assess whether the conditions for sustained impact are being built is a framework that will systematically miss the most important question about whether AI for development is delivering value.

What happens without shared standards

The history of evaluation in development gives a clear answer to what happens when the evidence infrastructure lags behind the investment cycle.

Without shared standards, the field optimises for what can be measured rather than what matters. Engagement metrics become proxies for impact. Adoption figures become substitutes for evidence of benefit. Funders make portfolio decisions on the basis of outputs rather than outcomes, and the interventions that scale are not necessarily the ones that work; they are the ones that can demonstrate numbers quickly.

In the absence of evaluative independence, self-reported success becomes the norm. Implementing organisations have powerful incentives to frame their results positively, and without independent evaluation, there is no structural check on this. The history of impact measurement in development is littered with programmes that looked successful in self-reported data and failed in independent assessment.

Where shared criteria are missing, the field cannot learn across programmes. Evidence from a maternal health chatbot in Kenya cannot meaningfully be compared to evidence from an agricultural advisory platform in India if they are using different metrics, different methods, and different standards for what counts as success. The accumulated investment of hundreds of millions of dollars produces no cumulative knowledge, but a collection of individual case studies that cannot be synthesised into actionable learning.

This is not a hypothetical future. It is happening now.

What getting it right would look like

The GenAI for development field needs its evaluation standards for roughly the same reasons that the broader development sector needed the OECD-DAC criteria in the 1990s: money is flowing faster than accountability can follow, and with no shared standards, the field will not be able to distinguish what works from what merely appears to work.

Getting it right does not require starting from scratch. The Playbook is a foundation worth building on. What it needs is a development evaluation layer: one that adds what it currently lacks but does not duplicate what it does well.

That layer would include: a new Stage 1 relevance check, grounded in localisation thinking, asking whether the problem has been defined by communities rather than funders, whether GenAI is the right solution, and whether the evaluation framework has been designed with rather than for the people it will affect. It would map each evaluation level to appropriate evaluative criteria, including whether core global and regional evaluation frameworks such as OECD-DAC (adapted), UNEG Norms, AfrEA Guidelines, or a hybrid designed specifically for adaptive technology interventions should apply. It would provide guidance on evaluative independence for accountability claims. It would also include contribution analysis as an alternative causal approach for contexts where RCTs are infeasible, inappropriate, or simply the wrong question. Finally, it would incorporate a sustainability dimension that assesses whether the institutional, financial, and community conditions for sustained impact are being built, besides focusing only on whether the tool is working today.

(Note: Our team at The MERL Tech Initiative is working on guidance for a new Level 0, which would focus on Formative Research and Digital Design as critical precursors to developing any GenAI ‘solution’ or effort).

This is a collective task. No single organisation has the disciplinary range to build this alone. It requires technology developers, development evaluators, communities, funders, and researchers working together. It is precisely the kind of collaboration that has historically been difficult to sustain but that this moment requires.

The history of evaluation in development is, in many ways, a history of learning late. Frameworks and standards emerged after failures had already exposed their absence. The opportunity with GenAI is to do something different: to build the evidence infrastructure alongside the technology, rather than waiting until the consequences of poorly understood interventions become impossible to ignore.

GenAI may transform development practice. But whether that transformation improves people’s lives will depend not only on what the technology can do, but on whether the field has the discipline to ask where it should be used, for whom, under what conditions, and with what evidence.

The question is whether we will build those standards before the next wave of scale makes it impossible; or whether, once again, we will learn the lessons only after the costs have already been paid.

  Disclosure: This piece used Claude to assist in drafting a timeline of the evolution of evaluation in global development. The analytical framing, structure, and editorial choices are my own, informed by my professional judgement and experience.  

Leave a Reply

Your email address will not be published. Required fields are marked *