We publish a reproducible benchmark of Wolof voice AI.

We benchmark, evaluate, and supply data to frontier labs working on Wolof and French code-switched voice.

104 Senegalese voice samples 6 system configurations Every system above 0.7 mean WER on Wolof speech 23 of 67 numeral tests reveal the dërëm gap

Read the benchmark →Engage Kuma →

Whisper-1 mean WER 1.049 on WolofChirp 2 (wo-SN probe) silently returns unintelligible outputgpt-4o-transcribe echoes the prompt on near-silent audio7× more Wolof numerals parsed correctly than Whisper-1Intent recall: 43% corpus-faithful · 73% schema-restrictedWhisper-1 mean WER 1.049 on WolofChirp 2 (wo-SN probe) silently returns unintelligible outputgpt-4o-transcribe echoes the prompt on near-silent audio7× more Wolof numerals parsed correctly than Whisper-1Intent recall: 43% corpus-faithful · 73% schema-restricted

Talking is faster. At high transaction volumes, typing becomes a throughput ceiling.

Wolof is spoken by 12 million people across Senegal, Gambia, and Mauritania, almost always code-switched with French. It is also a language no commercial ASR ships production-ready: every benchmarked system fails on numerals, code-switching, and the silent prompt-echo failure mode we documented.

Frontier labs and voice-product teams shipping Wolof and French code-switched voice come to Kuma for three things: a reproducible benchmark to measure their model against, a curated Wolof corpus tuned for production conditions, and the engineering primitives — wolof-numbers, wolof-ner, dërëm parsing — that fix the failure modes raw ASR cannot.

We constrain outputs, validate numerics, and enforce schema before results reach production systems. WER measures whether you got the words; we measure whether you got the transaction. With structured output and schema validation, the right metric is intent recall — and the gap there is 43% raw vs 73% with our ops layer.

Field-tested. We pitched two Senegalese MFIs in March 2026; both said no. Six weeks of conversation taught us the operator-led wedge wasn't the moat — the lab-grade evaluation work is.

The same number word can mean two different amounts.

Senegalese market merchants quote prices in dërëm without saying the word. "ñaar junni" can mean 2,000 CFA (direct reading) or 2,000 dërëm = 10,000 CFA (implicit-dërëm reading, Guérin 2021 §2.6). There is no universally correct default. Picking one silently is wrong in real systems.

Our parser returns both interpretations and flags the field for human confirmation. This is a core design decision, not a bug.

Ambiguous · bare commerce numeral

"ñaar junni"

amount: 10,000 XOF (implicit dërëm — confidence 0.6)alt: 2,000 XOF (direct CFA — confidence 0.4)needs_confirmation: true

Explicit dërëm · unambiguous

"dërëm fukk"

amount: 50 XOF (10 dërëm × 5 — confidence 1.0)

This is what the Wolof number parser does. Open-sourced on PyPI as wolof-numbers — covers compound forms, genitive constructions, the loanword boundary, and the dërëm convention from 1 to 1 billion. It exists because no commercial ASR resolves any of this on its own. Full treatment in the report (Failure 7).

Sample: a single utterance, the failure, the fix.

Pulled from the 104-sample corpus. Whisper hears the number; the dërëm × 5 conversion never fires. Kuma's parser surfaces both interpretations and asks for confirmation.

Sample · payment-UTT-005 · Wolof voice

“Awa jënd na ñaar junni”

Original utterance · payment context

↓

Whisper gpt-4o-transcribe · raw output

“Awa jënd na ñaar junni.”

WER 0.0 · numeral parsed as 1,000 (×5 short) · expected 2,000 CFA

× Numeral underflow — the dërëm gap

→↓

Kuma stack · processed

amount: 10,000 XOF (implicit dërëm — confidence 0.6)
alt: 2,000 XOF (direct CFA — confidence 0.4)
needs_confirmation: true

✓ Both interpretations surfaced — flagged for confirmation

See all 22 numbered failures →

What the benchmark shows

Transcription accuracy, numeral ASR rate, intent top-1 across six system configurations.

Bar chart comparing transcription accuracy, numeral ASR rate, and intent top-1 across six Wolof voice AI system configurations: Whisper-1, Gemini 2.0 Flash, Google STT Chirp 2, and three Kuma pipeline variants.

Higher is better on every bar. Kuma end-to-end leads on numeral ASR and intent; raw ASR providers cluster on transcription. Read the methodology in the report →

Specific failures, named.

Every finding in the benchmark cites the model and the failure mode by name.

“Whisper-1 echoes the prompt on near-silent Wolof audio”

When input is ambient-noise or silence, Whisper-1 reproduces its own system prompt verbatim rather than returning an empty transcript.

“Chirp 2 silently accepts wo-SN and returns unintelligible output”

Google STT Chirp 2 accepts the Wolof locale code without error, then returns incoherent output on native Wolof speech — no failure signal surfaced.

“Intent recall: 43% corpus-faithful · 73% schema-restricted”

Without a restricted schema, intent extraction drops to 43% on real Wolof utterances. Schema constraints lift that to 73% — the difference is the ops layer.

“The dërëm convention: same numeral, two amounts, same language”

A bare Wolof numeral can refer to CFA francs or to dërëm units (1 dërëm = 5 CFA). No commercial ASR resolves this disambiguation by default.

These failures translate directly into incorrect transaction amounts and broken voice workflows in production.

How to engage.

Two ways to work with us. Prices are published.

Evaluations

We benchmark your ASR, TTS, or LLM against our Wolof + French test set. Comparative report, failure-mode analysis, production-readiness verdict.

From $15,000 · 4–6 weeks

Engage →

Datasets

Hand-curated Wolof and Bambara voice corpora. Consent-cleared, domain-tuned, delivered with the harness to evaluate them.

From $30,000 · 6–12 weeks

Engage →

Production integration of wolof-numbers, wolof-ner, and our domain dictionaries into operator voice stacks is engaged case-by-case, typically following an evaluation. Provider-agnostic; we work alongside Whisper, Gemini, Chirp, AssemblyAI, Deepgram, or your own ASR. See custom engagements →

Show, don't claim.

Our core components are open-source — use them, fork them, run them on your own data.

wolof-numbers

Apache-2.0 · PyPI

1 to 1 billion. CFA / dërëm / loanword handling. Compound forms, genitive constructions, the dërëm convention.

&nearr;

wolof-ner

source-available · in progress

Named-entity recognition tuned for Wolof voice transcripts: person names, place names, financial terms.

&nearr;

state-of-wolof-2026 harness

Apache-2.0 · GitHub

The eval harness reproducing every table in the report. 104-sample matrix, 6 system configurations, reproducible from source.

&nearr;