Masakhane

Masakhane is the volunteer-driven Pan-African NLP research collective founded in 2019 to advance African-language AI research, with open-research output across more than 60 African languages and active cooperation with Cohere for AI and Hugging Face.
Masakhane

Masakhane

Masakhane (a Zulu word meaning "we build together") is a volunteer-driven Pan-African natural-language-processing research collective founded in 2019 to advance African-language AI research. The collective operates as a distributed research community of approximately 1,500 researchers across more than 30 African countries plus African-diaspora researchers globally, with a research mandate explicitly oriented around the under-resourcing of African languages in the broader academic and industry NLP literature. Masakhane's principal research outputs include the MAFAND machine translation benchmark and dataset (covering 21 African languages, 2022), the AfroLM and AfriBERTa multilingual African-language models, the AfricaNLP workshop series at major NLP conferences (NAACL, EACL, EMNLP, AfricaNLP), and substantial published research output at major NLP venues. The collective has anchored substantial research-cooperation engagement with Cohere for AI (on the Aya multilingual-foundation-model program), Hugging Face, Lacuna Fund, Google Research, and adjacent multilingual-AI peer organizations. As of April 2026, Masakhane is the principal Pan-African NLP research community globally and a structurally consequential contributor to multilingual AI research that addresses the long-standing under-representation of African languages.

At a glance

  • Founded: 2019 by African and African-diaspora NLP researchers including Jade Abbott, Bonaventure Dossou, Kelechi Ogueji, Salomon Kabongo, and adjacent founding contributors.
  • Status: Volunteer-driven research collective. Operates without formal corporate structure; community coordination is anchored on the masakhane.io website, Discord/Slack channels, and the AfricaNLP workshop series.
  • Funding: Volunteer-driven with periodic grant funding. Substantial Lacuna Fund grants for African-language dataset development. Adjacent grant funding from the IDRC (International Development Research Centre, Canada), Mozilla Foundation, Google Research, and other research-and-development funders.
  • CEO / Lead: No single principal leader. The collective operates with rotating coordination across senior contributors. Jade Abbott (former Lelapa AI co-founder; one of the principal early Masakhane organizers) and Bonaventure Dossou (Mila researcher) have been among the more publicly visible coordinators.
  • Other notable leadership: Kelechi Ogueji, Salomon Kabongo, David Adelani, Pelonomi Moiloa (later co-founder of Lelapa AI), and adjacent senior contributors. Many Masakhane affiliates have subsequently founded or joined commercial African-AI organizations including Lelapa AI and adjacent African-AI startups.
  • Open weights: Yes. Masakhane research outputs are released open-source through GitHub and Hugging Face under permissive licenses.
  • Flagship outputs: MAFAND machine translation benchmark and dataset (21 African languages, 2022); AfroLM and AfriBERTa multilingual African-language models; the AfricaNLP workshop series at major NLP conferences; participation in the Aya multilingual program with Cohere for AI; substantial NAACL, EACL, EMNLP, and AfricaNLP publication output.

Origins

Masakhane was founded in 2019 by a group of African and African-diaspora NLP researchers concerned about the under-representation of African languages in the broader academic and industry NLP literature. The founding period was anchored by Jade Abbott, Bonaventure Dossou, Kelechi Ogueji, Salomon Kabongo, and adjacent early contributors, with the founding research thesis that distributed-volunteer-research could produce African-language NLP research output that no single institution alone could match given the linguistic diversity of Africa (more than 2,000 languages spoken across the continent, with many lacking even basic NLP infrastructure like tokenizers, parallel-text corpora, or pre-trained language models).

The 2019 to 2021 period built the community-research infrastructure. The Masakhane GitHub organization aggregated open-source repositories for African-language NLP research; the masakhane.io website served as the principal community-coordination front-end; Discord and Slack channels anchored ongoing research conversation. The November 2020 NeurIPS Black in AI workshop and the subsequent 2021 AfricaNLP workshop at EACL anchored the conference-engagement programs.

The 2022 to 2024 period saw substantial published research output. The MAFAND benchmark (a machine-translation benchmark and dataset covering 21 African languages, 2022) became the principal evaluation benchmark for African-language machine translation. The AfroLM (multilingual African-language model trained with self-active-learning approach, 2022) and AfriBERTa (multilingual African-language pre-trained model, 2021) provided open-weights African-language models that subsequent academic and industry research built on. Senior Masakhane contributors including Pelonomi Moiloa and Jade Abbott co-founded Lelapa AI in 2022 as a commercial African-AI venture, with Lelapa subsequently developing the InkubaLM small African-language model line; the Lelapa-Masakhane relationship anchored the commercial-and-research connection within the broader African-AI ecosystem.

The 2023 to 2026 period has continued community-research output and industry-cooperation engagement. The cooperation with Cohere for AI on the Aya multilingual-foundation-model program (which covered 101 languages including substantial African-language coverage) anchored the principal industry-cooperation engagement. Cooperation with the Lacuna Fund (a $50-million-plus initiative funding underrepresented-language datasets), the Mozilla Foundation, and adjacent research-and-development funders has provided sustained dataset-development grant support.

Mission and strategy

Masakhane's stated mission is to strengthen and spur NLP research in African languages, for Africans, by Africans. The collective explicitly frames the work as community-driven rather than top-down, with research-program direction emerging from researcher-contributor proposals rather than central coordination. The strategy combines three threads. First, distributed-volunteer-research output across African-language NLP areas including machine translation, language modeling, named-entity recognition, sentiment analysis, and adjacent core NLP capabilities. Second, the AfricaNLP workshop series at major NLP conferences as the principal academic-community engagement platform. Third, industry-cooperation engagement with Cohere for AI, Hugging Face, Google Research, Lacuna Fund, and adjacent organizations that anchors commercial-and-research connection within the broader multilingual-AI ecosystem.

The competitive premise is that African-language AI capability requires sustained community-driven research investment that no single institution alone can produce given the linguistic diversity of Africa, that distributed-volunteer-research coordination can produce research outputs that compete with commercial-AI-organization-driven multilingual research at substantially lower formal-organizational overhead, and that the volunteer-driven structure produces durable research-community engagement that formal-organizational alternatives cannot match.

Models and products

  • MAFAND. Machine translation benchmark and dataset covering 21 African languages. Released 2022.
  • AfroLM. Multilingual African-language model trained with self-active-learning approach. Released 2022.
  • AfriBERTa. Multilingual African-language pre-trained model. Released 2021.
  • AfricaNLP workshop series. Annual workshop at major NLP conferences (NAACL, EACL, EMNLP, AfricaNLP).
  • Cooperation with Cohere for AI on Aya. Substantial African-language coverage in the Aya multilingual-foundation-model program (101 languages).
  • Active publication record. At NAACL, EACL, EMNLP, ACL, COLING, AfricaNLP, and adjacent NLP venues.
  • Distributed African-language NLP infrastructure. Tokenizers, parallel-text corpora, evaluation benchmarks, and adjacent NLP infrastructure across multiple African languages.

Distribution channels combine open-source releases through GitHub and Hugging Face, academic publication, the AfricaNLP workshop series, and informal community-coordination channels (masakhane.io, Discord, Slack).

Benchmarks and standing

Masakhane's evaluation framework focuses on the volume and quality of African-language NLP research output (publication count, citation impact, dataset adoption metrics) and the impact on the broader multilingual-AI research-community engagement. The MAFAND benchmark has been adopted by subsequent commercial and academic multilingual-AI evaluation programs as the principal African-language machine-translation benchmark.

Industry coverage has consistently characterized Masakhane as the principal Pan-African NLP research community globally, with the multi-year operating history, the substantial published research output, and the cooperation with Cohere for AI's Aya program as principal validating data points.

Leadership

As of April 2026, Masakhane operates with rotating coordination across senior contributors rather than a single principal leader:

  • Jade Abbott, founding-period coordinator. Subsequently co-founded Lelapa AI.
  • Bonaventure Dossou, founding-period coordinator. Concurrent Mila researcher.
  • Kelechi Ogueji, Salomon Kabongo, David Adelani, Pelonomi Moiloa, and adjacent senior contributors.

The volunteer-driven structure has produced substantial researcher-contributor breadth across African and African-diaspora communities, with approximately 1,500 contributors reported as of recent community surveys.

Funding and backers

Volunteer-driven with periodic grant funding. Substantial Lacuna Fund grants for African-language dataset development (Lacuna Fund is a $50-million-plus initiative funding underrepresented-language datasets, supported by IDRC Canada, Schmidt Futures, and adjacent partners). Adjacent grant funding from the IDRC (International Development Research Centre, Canada), Mozilla Foundation, Google Research, and other research-and-development funders.

Industry position

Masakhane occupies a distinctive position as the principal Pan-African NLP research community globally and as a structurally consequential contributor to multilingual AI research. Industry coverage has consistently characterized Masakhane as one of the structurally consequential community-driven AI research collectives of the post-2018 era, alongside EleutherAI, BigScience, LAION, and adjacent volunteer-driven research collectives.

The structural risks are two. First, the volunteer-driven structure depends on sustained researcher-contributor engagement; key-contributor departure to commercial AI organizations (the Lelapa AI founding by senior Masakhane contributors is the principal example) reduces community-research capacity, even as it anchors commercial-and-research connection within the broader African-AI ecosystem. Second, the funding base is structurally less stable than formal-organizational alternatives, with grant-funded dataset-development cycles dependent on continued external funder commitment.

Competitive landscape

  • Cohere for AI / Aya program. Cooperation partner more than competitor. Substantial African-language coverage in Aya was anchored on Masakhane research input.
  • Lelapa AI. Senior Masakhane-contributor-founded commercial African-AI venture. Cooperation partner; commercial-and-research connection.
  • AI4Bharat, Sarvam AI, VinAI Research, AI Singapore. Multilingual-AI peer organizations focused on under-served languages.
  • Hugging Face, BigScience, LAION, EleutherAI. Volunteer-driven and community-driven research peer organizations.
  • Google DeepMind, Meta AI / FAIR, Microsoft AI. Frontier AI labs with multilingual research programs (Google's IndicNLP, Meta's NLLB / No Language Left Behind, Microsoft's multilingual evaluation efforts).
  • Lacuna Fund, Mozilla Foundation, IDRC Canada. Funding peer organizations supporting African-language and other under-resourced-language AI research.

Outlook

  • Continued African-language NLP research output through 2026 to 2027.
  • Continued cooperation with Cohere for AI, Hugging Face, Google Research, and adjacent multilingual-AI peer organizations.
  • Continued AfricaNLP workshop cadence at major NLP conferences.
  • The senior-contributor pipeline and any continued commercial-AI-organization recruitment of Masakhane-affiliated researchers.
  • Continued grant-funder relationships with Lacuna Fund, Mozilla, IDRC, and adjacent funders.

Sources

About the author
Nextomoro

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.