A team at the University of Cape Town (UCT) has developed a new artificial intelligence (AI) language model trained specifically on South Africa's 11 official written languages ​​– helping to bridge the gap that has left millions of people without access to mainstream AI tools.

The research, which will be presented this month at the Language Resources and Evaluation Conference (LREC) in Majorca, Spain, makes two interconnected contributions. The first is MjansiText, a curated multilingual dataset covering 11 official written languages; and MjansiLM, a language model trained on that dataset from the beginning. The work was led by Henri Lombard and Dr Jan Buys from UCT's Department of Computer Science, together with Dr François Meyer and a wide team of collaborators.

The paper comes at a time when AI language tools have become part of the daily lives of millions of people around the world. But for speakers of most South African languages, this reality looks quite different. Ask any popular AI assistant a question in ISinDebele or Sepedi, and the response is likely to be poor, inconsistent, or downright wrong. Researchers say the reason is data.

“In language modeling, languages ​​are considered to be under-resourced, primarily because there are very few and small textual datasets available in these languages ​​for training language models,” said Dr. Buys, a senior lecturer in the Department of Computer Science. “Our dataset, MzansiText, is still small compared to data available for high-resource languages ​​such as English and major European and Asian languages, but larger than previous datasets for South African languages.”

“MjansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages.”

Nine of South Africa's 11 official written languages ​​fall into this low-resource category. Languages ​​such as isiZulu and IsiXhosa have received some attention from the global research community, but others including isiNdebele and Sepedi have been largely ignored. MjansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages.

“There has been real progress in language modeling for African languages, including some South African ones such as isiXhosa and isiZulu,” said Dr. Meyer, a lecturer in the Department of Computer Science. “But most existing models only cover a subset of languages. With MjansiLM, we wanted to create a single model focused specifically on South Africa that covers all 11 official written languages, including those that are often left out.”

From master's research to a baseline for the field

For Lombard, a master's student in computer science, the project began with a recurring question in his research.

“I came to this work through my master's research, which looked at how different language-model architectures perform for low-resource languages, as this is still a relatively under-explored area,” he explained. “One thing that stood out to me is that publicly available models cover only a subset of the South African languages ​​we care about. The purpose of MjansiLM was to provide a small decoder-only baseline with which future work can be compared and built upon.”

This model, at 125 million parameters, is modest by the standards of today's commercial AI systems. But the team's tests showed it performed competitively on specific tasks, outperforming much larger open-source models on benchmarks in several South African languages. For example, on isiXhosa text generation, it produced results that competed with encoder-decoder models more than 10 times its size.

Not a chatbot, but a foundation

It is important to be clear about what MjansiLM is and what it is not. Unlike tools like ChatGPT or Cloud, it is not designed for open-ended conversations. It is a base model – a foundation that developers and researchers can customize for specific purposes through a process called fine-tuning.

“In practice, this means developers can create tools for specific use cases; for example, summarizing information in South African languages ​​or annotating raw data,” Meyer said. “If you want users to be able to interact with the system in their home language, adopting MjansiLM for limited use cases may be more effective and cost-effective than relying on a proprietary large language model.”

“Our findings show that the model can work well when properly tailored for specific tasks but is not yet able to work well for general purpose user interaction or instruction.”

More immediate benefits for everyday users will come from future larger versions of models and systems built on top of this foundation. But the research also sheds light on a broader question: Why do powerful commercial AI systems still struggle with languages ​​other than English?

“Our findings show that the model can work well when fine-tuned for specific tasks but is not yet able to work well for general-purpose user interaction or instruction due to limited training data,” explained Buys. “This helps explain why larger language models still don't work so well when used in languages ​​other than English.”

An open research community is essential

The team is clear that MjansiLM is a step, not a destination. It will require a sustained, collective effort to close the gap between the capabilities now available in South African languages ​​and English.

“The progress we have been able to make depends on previously open research from the African natural language processing research community, so it is essential to continue that openness,” Lombard said. “We still need better and comprehensive data sources, robust benchmarks and shared datasets, models, code and results that make it possible for others to reproduce and extend the work.”

The mayor echoed that viewpoint. “The research community plays an important role here by working openly, sharing datasets, models and findings so that others can build on them. This kind of openness often leads to progress, especially compared to proprietary systems where much of the data and methodology is inaccessible.”

The UCT team has made both MjansiText and MjansiLM publicly available. The paper, “MjansiText and MjansiLM: An open corpus and decoder-only language model for South African languages”, is available on arXiv.



Categorized in: