MASAKHANE is an research effort for NLP for African languages that is OPEN SOURCE, CONTINENT-WIDE, DISTRIBUTED and ONLINE. This GitHub repository houses the data, code, results and research for building open baseline NLP results for African languages.
Website: masakhane.io
Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans. Despite the fact that 2000 of the world’s languages are African, African languages are barely represented in technology. The tragic past of colonialism has been devastating for African languages in terms of their support, preservation and integration. This has resulted in technological space that does not understand our names, our cultures, our places, our history.
Masakhane roughly translates to “We build together” in isiZulu. Our goal is for Africans to shape and own these technological advances towards human dignity, well-being and equity, through inclusive community building, open participatory research and multidisciplinarity
-
Umuntu Ngumuntu Ngabantu - loosely translated from isiZulu means “a person is a person through another person” or “I am because you are”. This philosophy calls for collaboration and participation and community. It proposes relationality, over individualism for stronger social cohesions towards sustainable communities. It believes we share our successes and one’s personhood is evaluated based on their contributions to the community.
-
African-centricity. We centralize the narratives of Africans as a remedy to the effects of Euro-centricism on our beliefs. This way we reassert a new way of looking at information from a African perspective and shun any attempts to devalue our knowledge and stories
-
Ownership - We believe that Africans should be in charge of owning, driving and participating in the NLP research process, rather than as observers or data providers.
-
Openness - We believe in sharing our ideas and progress openly, especially on the African continent, for Africans. We’re against research that takes African contributions or data and puts them behind a paywall that is infeasible for Africans to access.
-
Multidisciplinarity - We truly believe that participation from all fields and experience and that multidisciplinarity leads to a more robust and more inclusive society
-
Everyone has valuable knowledge - We believe that each person’s individual experiences have value and each person is worth listening too and has something to contribute.
-
Kindness - We believe that being considerate, friendly and generous within our community is the best way to support it and encourage more inclusivity
-
Responsibility - We believe that each person in the technology process has an ethical responsibility to what they produce in the world. For this reason, we actively wreckon with the ethical impacts of our work
-
Data sovereignty - We believe Africans should be able to decide what data represents our communities globally, retain ultimate ownership of that data, and know how it is used
-
Reproducibility - We believe in reproducible research. As a result, we publish our code and data from our research so that others can reproduce and build upon it.
-
Sustainability - We believe that sustainability is necessary for societal change - that small daily efforts, over a long time are what truly change the world. To that, we aim for sustainability of our work, by being fully integrated with technological stakeholders to ensure the community continues to thrive into the future
-
For Africa: To build and facilitate a community of NLP researchers, connect and grow it, spurring and sharing further research, build helpful tools for applications in government, medicine, science and education, to enable language preservation and increase its global visibility and relevance.
-
For NLP Research: To build data sets and tools to facilitate NLP research on African languages, and to pose new research problems to enrich the NLP research landscape.
-
For the global researchers community: To discover best practices for distributed research, to be applied by other emerging research communities.
- Look at our submitted machine translation benchmarks here! Can't see your language? Please submit a benchmark!
- Check out our paper to be published at AfricaNLP Workshop @ ICLR 2020
- Check out papers written by our participants here
- Find our more about our current initiatives
- Look at our list of community documents
- Read our weekly meeting notes
- Follow our publication on Medium
There are many ways to contribute to MASAKHANE.
- TRAIN A MODEL - Contribute a trained model and related code for your language
- ANALYSIS - Contribute analysis of data/models for any African languages. You do not need any technical experience for this! If you're a linguist, we can pair you up with a NLP practitioner and you can help contribute analysis
- DATA - Help build or find datasets for your language
- DOCUMENTATION - Help document our discussions, progress. This is VERY much needed. Or contribute to documentation of the base "notebook" that will improve the experience of others
- MENTORSHIP - Provide advice or help tune models for their languages and datasets, or help people get started
- ADMIN - Working with so many researchers can be quite a challenge! Help out with administrative tasks
- COMPUTE - Help with infrastructure and compute! Do you have spare compute to donate? Let us know! We're always looking for more!
- BRAINSTORM Join our weekly meetings, provide advice or ideas
- STORY-TELLING - Tell our stories to the world by doing talks about the community, contributing to our Medium publication, or engaging with media outlets
- MLOps & ML Engineering - Do you enjoy delving into the MLOps side of machine learning? Are you a software developer looking to hone-in on your ML engineer abilities? Join us to help build tools to support out reproducability, data gathering, and model sharing!
Want more details? Check out our current initiatives
-
Join our Slack
-
Request to join our Google Group - this will add you to our weekly meetings
-
So we can feature you on our webpage masakhane.io, please fill in our membership form HERE:
Please be patient with a response via our email address, we're very behind on our administration, in the time of COVID-19.
- If you're on slack, you'll see a number of channels which reflect our initiatives (described below). Join them and start engaging
- Every week, we have an open meeting for our members. These are described on our meeting agenda where you can learn about the format, add and vote on topics. Make sure you've joined our google group
- If you're not sure what value you can add, check out our growing message board to see if there are any tasks you can pick up!
Every week we have more ideas, and more impromptu projects that emerge. Keen on any initiatives? Join our slack and find the respective group.
Working on a Masakhane initiative that is not listed here? Please add it with a PR ❤️
Keen to help on any of these initiatives? Please see our message board
Initiative | Description | Slack Channel | Repository |
---|---|---|---|
Machine Translation Benchmarks | Continued expansion and iterations on our language benchmarks as documented on the main GitHUB README | #benchmarks | HERE |
NER Datasets and Benhmarks | We're busy releasing datasets and research around NER | #ner | HERE |
Dataset Creation | We never have enough data. More is always needed. We have a number of members finding creative ways to build datasets. | #datasetcreation | |
Reproducibility | The goal is to ensure reproducibility and comparability of models and results. | #reproducibility | |
Takalani NLP | Development of Language Models for South African languages | #takalani-nlp | |
Wazobia | Yoruba, Igbo, Hausa and Nigerian languages NMT | #wazobia | |
Multilingual Chatbot | Developing multilingual chatbots | #multilingual-dialogue | |
Transfer Learning | Transfer Learning & Multilingual Expansion of Benchmarks | #transfer-learning | |
Evaluation of Masakhane Models | How good are the Masakhane models? How can we measure it, besides looking at BLEU scores? | #evaluation | |
Text-to-speech | Corpora and models for text to speech synthesis (TTS) from audio bibles in Ewe, Hausa, Lingala, Asante Twi, Akuapem Twi and Yoruba | #bible-speech | HERE |
See Code of Conduct