Bias in, bias out? How we’re understanding more about gender bias in LLMs
By Savita Bailur, Annie Brown, Kamila Wasilkowska
Towards the end of 2025, we were joined by Annie Brown, CEO of Reliabl, and Gender Insights founder Kamila Wasilkowska, to discuss their insights around LLM bias. We started our discussion with a short clip of the beautifully shot Indian feature film Humans in the Loop (2024), which showed how knowledge labelling by AI companies may be at odds with indigenous knowledge, through the story of Nehma, a data labeller from the Oraon tribe in Jharkhand. Annie Brown, CEO of Reliable and part of Humane Intelligence, reiterated this point as she opened our discussion, sharing insights from Reliabl’s research. Kamila’s talk, which followed, was also particularly timely, given the ongoing debate in the qualitative research community for and against the use of LLMs for research.
How data labelling can contribute to bias
Annie noted how annotation is often carried out by underpaid workers in the Global South who are asked to interpret Western-framed categories without context. Taxonomies such as “beautiful woman,” “racy,” or “pregnant” rely on subjective judgments that are shaped by culture, caste, race, class, colorism, and colonial norms.
- Reliabl’s analysis showed that across multiple image-generation models, 90 percent of outputs for “beautiful woman” depicted light-skinned women, even when ethnicity was implied.

- In India, Google’s annotators consistently labeled the same skin tones several shades darker than annotators in the United States, potentially influenced by local histories of caste and colorism.
- Images of women working out and pregnant women (especially Black women) were disproportionately labeled “adult” or “racy,” both by human labelers and by AI models (while male images were less likely to be labeled as such).
- Annie made the point that all this reinforced the need for more social scientists and researchers – in order to understand data and data labeling in context and in the AI pipeline. This echoed our October Gender, AI and MERL discussion, particularly Tattle’s work in India on contextualizing AI queries.
Annie’s conclusion was that more data does not necessarily result in equitable portrayal, but rather that data labeling is critical. Data annotation is not neutral and encodes intersecting biases that are later reproduced and magnified in AI outputs.
Experiments with LLMs for Qualitative MERL
While Annie examined the input side of the AI pipeline, Kamila Wasilkowska of Gender Insights explored what happens at the output stage, especially when MERL professionals use LLMs for qualitative analysis. Kamila acknowledged that while qualitative software with integrated AI functionality is available, many professionals may turn to more LLMs because of ease of access, familiarity with the functionality, as well as interface. In a sense, this could almost be an opportunity for democratization of research analysis.
Gender Insights conducted systematic tests of four AI models, including ChatGPT, Claude, and Gemini, using a simulated qualitative dataset representing a fictional women’s economic empowerment project in Mexico City (20 KIIs and FGDs). Kamila’s team evaluated the models’ performance on six dimensions, including thematic consistency, hallucinations, intersectional analysis, consistency over time, and agreement with human coders. Some key insights:
- Claude performed strongest on intersectionality, identifying more dimensions of identity and flagging missing data, for example, when transcripts lacked information on age, disability or LGBTQI identity. Claude also explained its reasoning, including calling out why excluded patterns matter for justice-oriented analysis.
- ChatGPT showed more inconsistency and contextual distortion, for example, misinterpreting legitimate concerns about excessive donor KPIs as “bias against donors” or misreading localization preferences as discriminatory rather than contextually grounded. It also failed to connect related thematic threads, such as the relationship between women’s increased income, household tensions, and community status.
- Prompting helped mitigate some issues, but Kamila noted that prompting alone cannot fix deeper structural biases embedded in the training data. Outputs also varied day to day, underscoring the importance of human oversight, bias audits, and model comparison through tools like OpenRouter.
- Kamila also flagged data safety considerations: how can respondents consent to AI-assisted qualitative analysis when even we as researchers are still learning how these models store, interpret, and reuse data?
Kamila noted that in her experience, the LLM bias was subtle rather than overt, which would require extensive validation by the researcher, and a duty of care to go back and check transcripts to confirm what was actually being said. Of course, we also agreed in our discussion that human bias exists, whether AI is being used or not.
Q&A and Community Priorities
Some insights in the Q&A:
- A poll we ran mid-way showed the majority of participants in our CoP experiment with LLMs at the brainstorming and literature review stage of qualitative research, but thematic analysis is as yet at early stages (at least within this group). Kamila’s insights were helpful for those who were curious but had early concerns.
- There was a general request from the participants to have private sector representation in these calls, so we could hear more about how such data is labeled or why biases emerge in qualitative research.
- Annie’s perspective was that when we reframe inclusion as a matter of accuracy, private sector companies are more open: “at the end of the day, that’s the selling point of the model”.
- Annie also noted that bias can be redressed by employing a more diverse data labeller demographic, but this also needs to be reflected at the higher QA level to ensure this group of individuals does not correct the original data labeling if it has a slightly different perspective (another point that is brought up in the Humans in the Loop film)
- Kamila felt LLMs actually increase the need for gender expertise, because researchers must be able to catch the subtle ways models may interpret things differently.
It was clear that 60 minutes was not enough time. We are trying to unpack centuries of gender and intersectional bias, especially now that these biases are being built into large language models. Our work continues!
For more on Reliable’s work, please contact annie@reliabl.ai
For more on Gender Insights’s work, please contact kamila@genderinsights.co.uk
You might also like
-
Event: What are the resources we need to navigate AI, gender and MERL?
-
Event recap: The Humanitarian AI Countdown and humanitarian knowledge production with Kristin Sandvik
-
Research Digest 2: State of AI Adoption and Competencies for Evaluators for Made in Africa AI in MERL
-
Event: Turning principles into actions – Made in Africa AI in MERL
