September 1, 2025

Event Recap: African languages, linguistic complexity, and ethical and inclusive AI

On August 14, 2025, the NLP Community of Practice’s AI in Africa Working Group came together to reflect on Africa’s place in AI—particularly how African languages can be integrated into emerging technologies. The discussion also explored how African language models can advance Monitoring, Evaluation, Research, and Learning (MERL) outcomes. Our keynote speaker, Prof. Mpho Primus, guided us through a much-needed conversation on this challenge. In this blog, I will be sharing insights and reflections from the working group meeting.

Linguistic Exclusion from AI

Prof Primus started out the session with an overview of AI and natural language processing (NLP) in Africa. She noted that while Africa is home to more than 2,000 languages – nearly a third of the world’s linguistic diversity – fewer than 2% are meaningfully supported in AI, and less than 5% have the kinds of digital resources (e.g., datasets, corpora, or toolkits) that make them usable for NLP. This exclusion is not simply technical. It threatens cultural preservation, limits access to technology, and determines who gets to participate in shaping Africa’s AI-driven future.

At the core of this exclusion is data scarcity. Unlike dominant global languages, many African languages lack digitized corpora, and some are primarily oral, making them harder to capture and annotate. Even where data exists, it is often scattered, unrepresentative, or locked in institutions. This cannot be solved by scraping what is online. Instead, there must be a commitment to understanding languages themselves—their structure (morphology), sound systems (phonology), use of tone (tonology), and context (pragmatics)—before encoding them into AI. Without this grounding, AI risks erasing nuance and flattening African languages into incomplete or distorted forms.

Finding a Way Forward for Africa

Prof. Primus compared two main approaches to natural language processing:

Rule-based systems that use explicit linguistic rules. These are especially well-suited to tonal languages in the Bantu family, where even slight tone changes alter meaning. These systems are transparent and work with minimal data, but are costly to scale and fragile when faced with variation.
Statistical systems (machine learning and deep learning). These adapt quickly and scale easily, but require massive datasets and often operate as opaque black boxes.

Both approaches come with limitations in African contexts. Rule-based systems demand expertise that is in short supply in varying forms across the continent, while statistical models depend on large datasets that don’t exist.

A hybrid approach offers a way forward. For tonal languages, for example, annotated phonological data could feed into statistical models, capturing tone alongside meaning. This not only preserves linguistic nuance but also ensures that variations within communities—such as generational shifts in usage—are not erased. The challenge is to build models that reflect the social and cultural realities of African speech communities, not simply repurpose Global North frameworks and tools with minor adjustments.

Grounding AI in Local Realities

There are already initiatives that show what is possible. Masakhane, for example, has demonstrated that Africa can lead aptly in African NLP development while keeping work rooted in community-driven, inclusive practices. Decentralisation is key here—it ensures that knowledge production is not concentrated in Global North academic or corporate hubs, but spread across Africa, where communities decide what data to collect and how to govern it.

Participants attending the session from outside Africa shared parallels with Indigenous language movements in Canada and Latin America, where data , like land or heritage, is seen as something to be protected and owned by communities. These connections highlighted aspects like awareness, activism, data rights and data sovereignty as important strategies for resisting extractive models of technology and ensuring AI reflects lived African realities. (See Indigenous Perspectives on AI in Canada: How protocols and protections can be put in place for data sovereignty, and Content Moderation in the Global South: A Comparative Study of Four Low-Resource Languages for example).

Small Language Models: A Strategic Angle for Africa

Prof. Primus suggested that Small Language Models (SLMs) may be Africa’s best path forward. Unlike massive LLMs, SLMs do not require vast datasets or expensive infrastructure. By focusing on core NLP tasks—translation, speech recognition, text classification—smaller, more agile models can be built to fit African contexts.

But for this to succeed, investment is crucial. African governments must step in with resourcing for affordable computing infrastructure, create open repositories, and ensure that communities, not corporations, benefit from these efforts.

What is at stake for Africa politically and ethically?

Prof. Primus reminded us that much of the current interest in African languages comes from commercial actors, whose focus is often return on investment rather than community empowerment. If left unchecked, this could lead to digital colonization, where external international corporations own and control African language data resources.

To avoid this, she called for African governments to take leadership in funding and research. The spending comparison for AI research and development between the EU and Africa, for example, is significantly different. “It is important that African governments also put money where their mouth is, because this also then safeguards our sovereignty in terms of data and AI”. At the core of all of this are three questions: Who owns the data? Who benefits? And how do communities retain control?

Interdisciplinary Collaboration

Inclusive AI cannot be built by technologists alone. It must involve linguists, cultural custodians, and community representatives. Only then can models capture things like code-switching (shifting between languages), semantic shifts (changes in meaning depending on context), honorifics (forms of respect in speech), and pragmatic contexts (how language is shaped by social realities).

This is not just about fine-tuning Western-built models, but about imagining entirely new frameworks designed from the ground up for African languages. This question was Prof Primus’ parting reflection and call to action:

“Should we come up with different algorithms and models altogether for our languages, instead of just importing and tweaking and changing certain frameworks and certain parameters?”

My Reflections

The conversations from this session reminded me that the work of building African languages into AI is not just technical—it is deeply cultural, political, and ethical. It is about how Africa claims its space in shaping the digital future, ensuring that our voices, knowledge, and identities are not erased but preserved and strengthened.

What stood out most for me is that there is no NLP without the languages themselves. Too often, the AI conversation races ahead to models and tools, while forgetting that the foundation must be documenting, archiving, and understanding our languages through collaborative, interdisciplinary work.

I was also struck by the importance of hybrid approaches. African languages, with their tonal richness and context-driven meanings, cannot simply be fitted into imported frameworks. They require solutions that bring together the transparency of rule-based systems and the adaptability of machine learning, so that nuance is not lost in the pursuit of scalability.

Finally, the question of data sovereignty remains at the heart of it all. If African governments and institutions do not take the lead in building and governing our own digital public infrastructure, then the development of AI for our languages will once again be shaped by external non-African agendas. Making African languages visible in AI means ensuring that ownership, governance, and benefits remain firmly rooted in African communities.

Event Recap: African languages, linguistic complexity, and ethical and inclusive AI

Leave a Reply Cancel reply

You might also like