Loba — About

About Loba

Languages found, never lost.

Dholuo language data exists — in fragments. Academic wordlists behind paywalls. Bible translations repurposed for NLP experiments. Scattered repositories with inconsistent licensing. Proverb collections in out-of-print books. None of it structured consistently enough to build reliable digital tools on, and almost none of it freely available for the developers and researchers who need it most.

Loba is the harmonising layer. A single, openly licensed, community-verified corpus where existing Dholuo data finds a proper home — and where every new contribution is structured from day one to power the next generation of Dholuo NLP tools. Spell checkers. Translation engines. AI assistants that actually speak the language.

Loba is designed from the ground up to grow across languages. Every architectural decision — from the database schema to the API — supports multiple languages without restructuring anything. Adding a new language is one row in a database table. The community, the review pipeline, and the open licensing follow automatically.

What makes Loba different

🔗

Harmonised, not hoarded

Existing Dholuo datasets, wordlists, and collections are welcomed into Loba with full attribution. We absorb and structure what already exists, not just what we create ourselves.

✅

Community verified

Every entry is reviewed before joining the public corpus. A wrong translation poisons downstream AI models — quality control is not optional, it is the point.

⚙️

Structured for machines

Every entry includes source text, translation, example sentence, cultural explanation, dialect region, and part of speech. Not a word list — a training-ready dataset.

🔓

Open by design, not by accident

Code is MIT licensed. Data is CC BY 4.0. Every export format — JSONL for model training, CSV for research — is publicly accessible without an API key or a login.

The goal

A corpus large enough and clean enough to fine-tune a small language model on Dholuo. A spell checker any Luo speaker can use in their browser. A translation API that health workers, teachers, and government services can build on. Infrastructure as solid as any major world language — built by the community it serves.

This is not a research project that ends with a paper. It is a living dataset that grows with every contribution, every review, and every new language that joins the platform.

How it works

Contributors submit entries

Words, phrases, proverbs, and sentences — with translations, example sentences, and cultural context. No linguistics degree required. If you speak, you qualify.

Community review

Every entry is verified by reviewers before joining the public corpus. Entries are typically reviewed within 24 hours.

Open forever

Approved entries join the public corpus under CC BY 4.0 — downloadable as JSONL or CSV, queryable via API, free for anyone to build on.

Languages in the corpus

Languages planned for future phases: Kikuyu, Kamba, Kalenjin, and others across East Africa. Each new language follows the same open, community-verified process — one INSERT into the languages table, zero architectural changes. If you represent a language community and want to start a corpus, open a discussion.

Corpus stats

Ecosystem

Loba is built to exist alongside and connect with the broader African NLP ecosystem — including Masakhane, the pan-African NLP research initiative, and academic partners at universities across East Africa. Researchers are welcome to use, cite, and contribute to the corpus. Collaboration requests are open via GitHub Discussions.

Licences

Code: MIT — github.com/Omollos/loba
Data: CC BY 4.0 — free to use with attribution. Cite as: Loba Corpus, github.com/Omollos/loba

Built in Kisumu

Loba was started in Kisumu, Kenya in 2026 by Steve (Omollos) as an open source project rooted in the East African tech and language community. Questions, ideas, dataset contributions, and collaboration requests are welcome via GitHub Discussions.