Omar Florez (right) is part of a team training the large language model Latam-GPT.Credit: CENIA
In early 2023, Álvaro Soto was looking for Las doce figuras del mundo, a short story co-written by one of his favourite authors, Jorge Luis Borges. To track down the book in which the story was originally published in 1942, Soto enlisted the help of artificial-intelligence chatbot ChatGPT. But the response did not sit right with Soto. “This is wrong,” he thought.

Nature Outlook: Robotics and artificial intelligence
The book ChatGPT proposed was one that Soto knew, and he was sure that the story — which translates as The Twelve Figures of the World — was not in it. For him, this is just one example of how AI technology often fails to grasp nuance, especially that which depends on cultural context. “These models were not trained with quality data from our region,” says Soto, a computer scientist and director of Chile’s National Center for Artificial Intelligence (CENIA) in Santiago. “When they don’t find specific information, they make it up.”
Like many computer and data scientists, Soto thinks that AI will drive a technological revolution, just as the Internet did three decades ago. As AI-based products flood the global market, it’s important to him that no one in Latin America — including those who speak minority languages, such as Rapa Nui spoken by those on Chile’s Easter Island — are left behind.
Roughly 7,000 languages exist worldwide but fewer than 5% are meaningfully represented online. Languages spoken by communities in Asia, Africa and the Americas account for about 5,500 of the total. Many of these are spoken by small populations and are at risk of disappearing by the end of the century. English, Spanish and French dominate in these regions as a result of their colonial history. AI systems are built mostly on data using these widely spoken languages, especially English, for which there are plenty of tools and data needed for natural-language processing.
This imbalance can hinder AI’s global reach. “Many of us live in multicultural societies, and many of our grandparents don’t speak English or type on a computer the way we do,” says Leslie Teo, a data scientist and a director of AI products at AI Singapore, a programme launched to boost the country’s AI capabilities. “If people don’t feel the AI understands them, or if they can’t access it, they won’t benefit from it,” he says.
Developers often say that their chatbot services can converse in a wide variety of languages. But the responses often have telltale signs that their core training was in English. “To deploy AI-driven technologies in our communities, they need to speak a language and context people are comfortable with,” says Mpho Primus, a computational linguist and co-director of the Institute for Artificial Intelligent Systems at the University of Johannesburg in South Africa. For her, that means going further than simply line-by-line translation; AI should also reflect culturally specific knowledge and norms. “How I speak to my mother-in-law is a very different way and use of words than how I speak to my mother,” she says.
Across Africa, Latin America and southeast Asia, researchers are collating data sets tailored to local cultures and languages that can be used to train AI systems. And some are already using these to build models locally. But because the architectures underlying these models are often developed in the United States, it remains to be seen what degree of cultural bias might still be present. “The global push to develop AI is no longer just about computing power or algorithmic breakthroughs,” wrote Primus in Nature Africa in June1. “It’s about who gets to speak and who gets left out in the digital future.”
Fluent, but not local
Transformer architecture — the artificial neural network that underpins so much of the current AI boom, and that is represented by the ‘T’ in OpenAi’s ChatGPT — started out in 2017 as a way to improve machine translation. It feeds on massive data sets of online content and pulls out patterns in how people speak and write, forming the basis for large language models (LLMs) such as GPT, Gemini and Claude.

The Latam-GPT team spent two years collecting data from public Latin American sources.Credit: CENIA
In October 2018, Google released its first LLM based on transformer architecture: BERT, which was trained in US English using text from Wikipedia (about 2.5 billion words) and the Toronto BookCorpus (800 million words from books published online by independent authors). BERT has since been surpassed in performance and flexibility by other transformer models, such as Llama and Gemma. It was ChatGPT, however, released in November 2022, that made the world really take notice of LLMs. In July 2025, the chatbot received queries from around 700 million users weekly in around 80 languages.
One of the main sources of training data for the generative AI systems that power modern chatbots is Common Crawl: a non-profit organization that has compiled an open archive of around 300 billion web pages since 2008. But it is strongly skewed towards English content. This is partly because of safety filtering. “Common Crawl applies filters to exclude unsafe content,” explains Vukosi Marivate, a data scientist at the University of Pretoria in South Africa. “But these filters don’t always extend well to languages other than English, so a lot of non-English content gets excluded.” The result is a data imbalance that limits the representation of non-English languages in LLMs.
Even speakers of English dialects, such as African American or Indian, Irish, Jamaican, Kenyan or Singaporean English, report feeling unrepresented. In 2024, researchers at the Berkeley Artificial Intelligence Research (BAIR) Lab at the University of California, looked at how two versions of GPT responded to texts written in eight widely spoken English dialects, alongside Standard American and British English. “We wanted to understand if discrimination was slipping under the radar,” says computer scientist Eve Fleisig at BAIR. She and her colleagues found that the versions routinely defaulted to American spelling, even when British spelling is the default in most non-US countries. When speakers of non-standard versions of English evaluated the chatbot’s responses, they flagged stereotyping and condescending or demeaning responses2. “Language carries power,” says Genevieve Smith, a social scientist at BAIR. “Reinforcing a particular way of speaking reflects what linguists call ‘standard language ideology’ — the belief that one variety is correct. It’s ultimately a power balance issue.”
The bias of LLMs can be seen in other ways, too. “It’s not just about language, but how the AI interacts with us,” Teo says. A 2024 study by researchers at Cornell University in Ithaca, New York, explored whether generative AI models reflect specific cultural value systems3. The team asked five GPT models to answer ten questions drawn from a global survey that is commonly used when studying psychology across cultures. The responses consistently aligned with the values of English-speaking and Protestant European countries, according to the authors. However, the scientists found that ‘cultural prompting’ — asking the models to respond from the perspective of someone from a specific country — helped to reduce the bias.
At CENIA, researchers have built a ‘cultural benchmark’ test to compare how well LLMs represent the knowledge of Latin American people. “We designed questions based on verified information, structured into ‘knowledge graphs’ that serve as a ground truth,” explains Bianca Sofía del Solar, who studies engineering sciences at CENIA. The benchmark evaluates coverage (how much of the expected knowledge the model retrieves), accuracy (whether the answers match the graph) and hallucinations (whether the model invents information). The results revealed a stark gap. “All the models tested could identify the country where Buenos Aires is,” says computer scientist Andrés Carballo at CENIA, “but mostly fail in knowing what is ‘porotos con rienda’, or who Carlos Caszely is”.
In southeast Asia, researchers developed the evaluation suite Southeast Asian Holistic Evaluation of Language Models (SEA-HELM), a benchmark to assess LLMs for southeast Asian languages and culture. Researchers used SEA-HELM to compare regional models with some level of proficiency in these languages against larger ones, such as GPT-4 or the Chinese-built DeepSeek R1. They found that models fine-tuned using local data outperformed the largest global models for southeast Asian languages and that performance can improve through curated data collection3. AI Singapore built some of the models tested. “Two years ago, southeast Asian content made up about 0.5% of the data in models like Llama,” says Teo. In its latest model, which is built on Gemma, “it is closer to 40%, with the rest comprising English (40%), code and Chinese”, he says.
Technology giants such as Meta are also increasing the number of languages that their LLMs can translate and are incorporating more regional data into their training sets. But quantity does not guarantee quality. A 2023 analysis found that many of the data are scraped from websites that aren’t hosted in the countries of speakers of the target languages or include machine-translated content. This can undermine authenticity and limit the performance of models in informal or culturally specific contexts, state the paper’s authors, contradicting statements made by companies regarding high-quality multilingual translation. Researchers are also concerned that by creating the impression that machine translation has been solved, crucial resources could be taken away from the local research ecosystem.
Fixing the data gap
In December 2023, researchers at AI Singapore launched the first version of Southeast Asian Languages in One Network (SEA-LION). The model, which was trained from scratch, used data from Common Crawl that was filtered to preserve southeast Asian content and data sets contributed by local communities. Content from southeast Asian languages made up 13% of the training data. “We need data filters that can recognize languages like Thai, Indonesian or Tagalog and preserve them,” says Teo, who is the team lead for SEA-LION. “That means redoing much of the crawling and filtering pipeline.”

Vukosi Marivate co-founded the Masakhane initiative. Credit:
AI specialist Omar Florez at CENIA is leading the pretraining of a local LLM known as Latam-GPT. Rather than relying on web scraping, the team trains the LLM with high-quality sources, such as repositories of undergraduate and graduate theses, local books digitized by university libraries and even transcripts from legislative sessions of the Colombian Congress — a rich data set for understanding how arguments, negotiations and policies are constructed in the region. “Our hypothesis is that culture lives in these documents,” says Florez.
The Masakhane initiative — masakhane means “we build together” in isiZulu — is an African-wide community of around 1,000 participants from 30 countries, collaborating on projects focused on collecting speech, text and annotation data sets to support African-language models. “Grassroots organizations like Deep Learning Indaba and Masakhane were created to strengthen machine learning and natural-language processing in Africa, and to make sure Africans voices are heard,” says Marivate, who co-founded both initiatives.
Maskahane’s projects include MakerereNLP, which collects text and speech from East African university archives, newspapers, plays and radio transcripts, and MasakhaNER, a data set for named-entity recognition for ten African languages. This enables AI systems to identify key information such as names of people, places and dates in text — a task that can be especially complex in African languages, for which spelling and name patterns often differ from English norms. Other projects involving members of Masakhane are focused on collecting hundreds of hours of audio recordings of under-represented African languages to create high-quality speech data sets. These initiatives demonstrate how research involving speakers of local languages and local institutions, can scale the creation of meaningful data sets.
In 2025, the African Next Voices project announced it had recorded and transcribed 9,000 hours of everyday conversations across Kenya, Nigeria and South Africa funded by a US$2.2 million grant from the Gates Foundation in Seattle, Washington. The data set captures dialogue around farming, health and education in 18 African languages, including Kikuyu, Hausa, Yoruba, isiZulu and Tshivenda. It is also open access, enabling developers to create and train AI tools that transcribe, translate and respond in African languages.
Collecting data in under-represented languages improves model coverage and boosts representation, Marivate says. Still, “users must understand that current models won’t fully eliminate biases because dominant languages continue to shape the data”, he cautions.
Local models for global voices
High-quality regional training data sets are important, but they are only the first step. In the global south, researchers are developing models built on locally sourced data, using tools and techniques tailored to cultural and computational realities. “Above all, we need a belief in building for ourselves,” says Marivate.

Leslie Teo presents SEA-LION, which uses data filtered for southeast Asian content.Credit:
One popular approach is to build on top of existing models for which the trained parameters are publicly available to allow developers to use and adapt them, such as Llama and Gemma. The SEA-LION model, which was initially trained from scratch, is now built on Gemma 3. “With the first version, we proved that more southeast Asian data improved performance on regionally relevant tasks,” says Teo. “But it was computationally and data-wise expensive to do everything from scratch.” Supplementing existing models with region specific capabilities — the original SEA-LION data set was supplemented with data from Wikipedia and high-quality sources in Indonesian, Tamil, Thai and Vietnamese — resulted in a 10–100 times reduction in cost, says Teo. This approach can, however, cause ‘forgetting’, which is when a model loses skills such as coding or reasoning after exposure to new data by overwriting old information. The model aims to capture local nuances, such as recognizing that if someone who is Muslim asks for food recommendations it should avoid suggesting pork dishes. SEA-LION now has 30,000–50,000 monthly users inputting requests such as “translate this into Tamil”, says Teo,
The Latam-GPT project is using Llama 3 as its basis. The team spent two years collecting data from public Latin American sources and through collaborations with institutions across the region. “We categorized and filtered data to create a mix that represents the traditions, facts and cultural expressions of Latin America,” explains Florez. “Collecting data from smaller countries or niche topics is expensive and time-consuming,” he says, “but this approach fills gaps and leverages the model itself to preserve and amplify local knowledge often missing from global datasets.”
The team is still training Latam-GPT, but is aiming for it to be open-source and publicly available by January 2026. “We’re not competing with the global models,” says Soto. “Our goal is to build a tool from and for Latin America.” The training involves the new Latin American data and the original LLM documents to prevent forgetting, and incorporates translations of Indigenous languages, such as Mapudungun, Náhuatl, Quechua and Aymara, to preserve and integrate knowledge. “In many ways,” says Florez, “large language models could become the most durable repositories of cultural memory, lasting longer than books or archives.”
What distinguishes these efforts from those of the dominant technology firms is the close collaboration with local communities. We’re “working to build a network of people who believe in the project and want to contribute”, says Florez. Technical progress and a sense of ownership is enabling regional groups to shape digital tools that will increasingly govern their information ecosystems. AI models are not simply being localized — they are being co-created, with language at the forefront.