Google and African researchers start the WAXAL speech dataset

Voice tech finally stopped ghosting African languages after a massive open dataset tackled the data drought head-on.

Why voice tech kept failing locally
  • Many devices choke on African languages.
  • Over 2,000 tongues lack usable speech data.
  • Sub-Saharan users get locked out of convenience.
  • Data scarcity stayed the core blocker.
What WAXAL brings to the table
  • WAXAL rolled out as a large-scale speech dataset.
  • Named after the Wolof word for speak.
  • Covers 21 African languages.
  • Built to unlock inclusive voice systems.
What the dataset actually contains
  • Nearly two million recordings fuel the corpus.
  • Total audio clears 11,000 hours.
  • Roughly 1,250 hours are fully transcribed.
  • Studio speech supports text-to-speech work.
Who built it together
  • Makerere University gathered language data.
  • The University of Ghana supported 13 languages.
  • Digital Umuganda was led in five languages.
  • The African Institute for Mathematical Sciences added multilingual datasets.
Studio and quality control work
  • Media Trust helped produce clean voice recordings.
  • Loud n Clear handled professional audio capture.
  • Everyday speech balanced with studio voices.
  • Ethical collection stayed a priority.
Data ownership and access rules
  • Contributors keep rights to their data.
  • Researchers worldwide get open access.
  • Sharing does not erase local control.
  • Collaboration stays balanced.
Why it matters long term
  • Voice tools can finally serve local users.
  • Language preservation gains a digital backup.
  • AI systems learn from real speech.
  • The dataset is live under an open license.
 

Attachments

  • Google and African researchers start the WAXAL speech dataset.webp
    Google and African researchers start the WAXAL speech dataset.webp
    41.4 KB · Views: 42
Top