About Loba
Dholuo language data exists — in fragments. Academic wordlists behind paywalls. Bible translations repurposed for NLP experiments. Scattered repositories with inconsistent licensing. Proverb collections in out-of-print books. None of it structured consistently enough to build reliable digital tools on, and almost none of it freely available for the developers and researchers who need it most.
Loba is the harmonising layer. A single, openly licensed, community-verified corpus where existing Dholuo data finds a proper home — and where every new contribution is structured from day one to power the next generation of Dholuo NLP tools. Spell checkers. Translation engines. AI assistants that actually speak the language.
Loba is designed from the ground up to grow across languages. Every architectural decision — from the database schema to the API — supports multiple languages without restructuring anything. Adding a new language is one row in a database table. The community, the review pipeline, and the open licensing follow automatically.
What makes Loba different
The goal
A corpus large enough and clean enough to fine-tune a small language model on Dholuo. A spell checker any Luo speaker can use in their browser. A translation API that health workers, teachers, and government services can build on. Infrastructure as solid as any major world language — built by the community it serves.
This is not a research project that ends with a paper. It is a living dataset that grows with every contribution, every review, and every new language that joins the platform.
How it works
Languages in the corpus
Languages planned for future phases: Kikuyu, Kamba, Kalenjin, and others across East Africa. Each new language follows the same open, community-verified process — one INSERT into the languages table, zero architectural changes. If you represent a language community and want to start a corpus, open a discussion.
Corpus stats
Ecosystem
Loba is built to exist alongside and connect with the broader African NLP ecosystem — including Masakhane, the pan-African NLP research initiative, and academic partners at universities across East Africa. Researchers are welcome to use, cite, and contribute to the corpus. Collaboration requests are open via GitHub Discussions.
Licences
Code: MIT —
github.com/Omollos/loba
Data:
CC BY 4.0
— free to use with attribution. Cite as: Loba Corpus, github.com/Omollos/loba
Built in Kisumu
Loba was started in Kisumu, Kenya in 2026 by Steve (Omollos) as an open source project rooted in the East African tech and language community. Questions, ideas, dataset contributions, and collaboration requests are welcome via GitHub Discussions.