“Hey ChatGPT, what is missing from our reports?”: A Generative AI Use Case for Public Sector MERL


We are grateful to David Heath, Karly Macleod and Donna Anderson from the Shared Services Canada Evaluation Division for their vision in commissioning this exercise and their commitment and collaboration to strengthen their MEL processes.

Generative Artificial Intelligence (GenAI) is undoubtedly accelerating some monitoring, evaluation and learning (MEL) tasks. Evaluators are cautiously experimenting with the tools to find the most relevant GenAI use cases while recognizing bias. Earlier this year, MERL Tech Initiative (MTI) undertook a sprint for Shared Services Canada (SSC), a department in the Government of Canada (GoC), with the question: Can Generative AI help government evaluators identify what might be missing from reports to refine the next cycle of evaluations? 

SSC is an operational department within the Canadian government. It supports and manages IT infrastructure services across various government departments and agencies. The department conducts frequent audits and evaluations across a broad range of policies, programs, and responses (e.g., an infrastructure response to COVID-19 or upgrading mobile and fixed line infrastructure) and has consistently applied a gender lens, as do Canadian government programs for social or international development assistance. This was an invitation to a) further refine its approach and b) conduct a collaborative experiment on the value of GenAI for public sector MEL practitioners.

MTI is of the opinion that we need more documentation of such experiments to help the sector learn. SSC generously offered the output of the sprint as a public good and we collectively agreed to publish the process and learnings. This blog unpacks the scope, methods, and findings from the five-day pilot. 


The Scope: Gender Analysis Meets Generative AI

M&E professionals spend most of their time measuring and assessing the work of others, and while we are a reflective group, we may not always have the time or space to “do M&E on our M&E” in order to identify gaps and areas where we can improve our own practice. With this in mind, we were excited when we received a specific request from SSC: 

“Please experiment with a Generative AI tool to help identify potential gaps related to gender inclusivity and intersectionality in our evaluation reports for operational programs.”

Figure 1 illustrates the subsequent decision-making process.

Figure 1: MTI and SSC’s process of experimentation

It should be noted that we recognized the limitations and biases of ChatGPT (identified below) and equally that Gender-Based Analysis Plus (GBA+) is a Canada-focused framework. Other countries have other frameworks (also noted below).

Figure 2: Some identity factors considered in GBA+


Key Questions

The project centered on the following questions:

  1. Can ChatGPT be prompted to apply an intersectional lens to government evaluation reports?
  2. How well does ChatGPT identify gender and intersectional gaps in evaluations?
  3. What kinds of prompts produce the most accurate and insightful outputs?
  4. What are the risks, limitations, and ethical considerations of using GenAI in this way?

Methods: Prompting, Testing, Refining

We used an iterative and incremental method, grounded in human-in-the-loop testing. Five priority evaluation reports were selected in consultation with SSC – two internal to SSC and three broader Government of Canada reports. The evaluations focused on operational programs rather than social programs and did not contain personal, sensitive or confidential data.

Each report was run through a standardized workflow:

  1. Initial prompts asking ChatGPT to assess the report against the intersectional framework
  2. Manual review of both the report and ChatGPT’s analysis
  3. Prompt refinement based on observed weaknesses (e.g., vague outputs, wrong focus)
  4. Re-test with the revised prompt set
  5. Test output with a published “control report” that explicitly used GBA+ to ensure we were supplying the correct prompts
  6. Ask SSC staff to test the prompts against new reports

Mixed Results, Clear Lessons

1. ChatGPT undoubtedly speeds up MEL review but a reference framework is essential 

Supplying ChatGPT with a recognized framework or “look up” resource (such as GBA+ in this case) undoubtedly accelerated the review process and reduced hallucination. Providing the framework ensured a standardized reference that ChatGPT could use for interrogating each report.

However, we were aware of two major factors that we needed to keep in mind:

1) any biases in the GBA+ framework

2) any biases we may have introduced in the way we applied the framework as evaluators

To ensure replicability, at the end of the exercise, we asked SSC staff to apply the prompts we supplied (available at the end of this blog). They found consistent results where gender and intersectionality were absent in reports.

2. Chain-of-Thought Prompting Improves Depth

The quality of ChatGPT’s analysis improved dramatically with well-scoped prompts. For instance:

  • Directing ChatGPT to “focus on methodology sections” helped ChatGPT identify who the evaluations targeted
  • Asking whether identity-based barriers were addressed elicited more insight
  • Referring directly to GBA+ links reinforced alignment with the official standard

This was a reminder that supplying ChatGPT with a relevant framework and specific details will result in sharper insights than simply asking a general question. 

One major finding was that multi-step prompting (asking ChatGPT to extract framework topics, then evaluate each in turn) yielded deeper, more structured analysis than single-shot questions. This mimicked how a human evaluator might logically approach a report (see Appendix A below for prompts)

As a result of this above learning, we created a standardized 4-step prompt sequence:

  1. Extract topics from the GBA+ framework
  2. Load report and set context
  3. Evaluate the report topic-by-topic
  4. Summarize results in a standardized table

One clear implication from this is that evaluators and public sector staff generally would benefit from hands-on prompt training. Clear, structured prompts significantly improve GenAI outputs. Our second recommendation is that a standard library of tested prompts — like the one developed in this pilot — can enable replicability and scale. As evaluators, we should share more prompts as a public good to accelerate our collective use of GenAI.

3. Even with a reference framework, human oversight is critical

Across the five reports, ChatGPT consistently flagged a greater need for gender inclusivity. However, it was important for us to bear in mind a number of contextual factors:

  • The reports supplied were evaluations of operational programs within the Government of Canada that provided services to other government departments and agencies.  ChatGPT often defaulted to assessing the impact on citizens or end-users, rather than internal staff — indicating the need for scoping prompts to clarify the intended user group. ChatGPT also failed to pick up on intersectionality. For example, in one instance in the Mobile and Fixed Line Services report, ChatGPT missed a section analyzing accessibility for persons with disabilities.
  • We found that as reports did not have a consistently named “methods” section where data on gender and intersectionality were usually found, sometimes ChatGPT missed this information. We refined the prompt to explicitly focus on sections named Methodology or Approach, reminded ChatGPT that the demographic was not always an end-user and that intersectionality should also be considered.
  • In conversations with SSC, we found that in some instances, source matter itself could have been skewed – for example, GBA+ had been actively considered in the preliminary findings and early drafts of one SSC report. However, this material was later removed prior to the final report in the interest of brevity and the absence of a recommendation. This itself may be a useful checkpoint for future use of agentic or process orientated Gen AI — to understand at what stage an inclusivity angle may be cut from text and why.

4. Consider the generalizability of any such exercise

Often, GenAI exercises are very specific to a use case and could be dismissed if not directly relevant to an evaluator. However, we can ensure their generalizability by considering what other contexts they could be used in. For example, if we are researching inclusivity in evaluations, we could:

  • Explore Global Insights in Gender: Canada isn’t alone. Countries like the UK, Australia, and South Africa use similar gender inclusivity frameworks (Equality Impact Assessments, Gender-Responsive Budgeting). These learnings could be shared with relevant teams if they are helpful in refining evaluations.
  • Share cross-domain experiences: Use the approach of “relevant framework x GenAI” across other domains, such as education, health and so on. These do not have to test gender inclusivity but could be applied to other factors, such as geography, age, education level and so on.

What Next?

One of the authors further tested this exercise in her role as an adjunct professor at Columbia University’s School of International and Public Affairs, in a graduate lab class on practical tools for international development. The aim was to understand if we were missing anything in the approach or even if it was too simplistic. Future international development practitioners saw this as a very useful practical exercise. While they raised valid questions around bias and hallucination, and how such pilots might lead to systemic change, they noted that the key insight was the value in using a reference set (in this case, the GBA+ Framework).

This short pilot proved that even with limitations, GenAI tools like ChatGPT can support gender-responsive evaluation and brainstorming, if thoughtfully applied. As Stefan Verhulst writes in his paper “Inquiry as Infrastructure”, policy-makers are increasingly engaging with AI and well framed questions are key to better policy. We are currently talking to Shared Service Canada about further sprints, and look forward to hearing from you if this is something you are also interested in exploring.


For more details

  • Appendix A shows the final prompts we used in this exercise.
  • Appendix B gives an explanation of how we checked prompts against a “control” report that explicitly employed a GBA+ Framework.

Leave a Reply

Your email address will not be published. Required fields are marked *