Cohere for AI, the research arm of the language artificial intelligence (AI) company, has introduced two new large language models (LLMs) — Aya Expanse 8B and 32B — as part of its ongoing project aimed at closing language divides in foundational AI datasets and models. The Aya Expanse models provide researchers access to advanced AI capabilities across 23 languages, including Arabic, Chinese, French, and Hindi.
“Building on more than two years of open science research, Aya Expanse offers significant performance advances, setting a new state-of-the-art for multilingual LLMs,” the Cohere website states. “This includes a series of breakthroughs in data arbitrage, preference training for performance and safety, and model merging.”
According to a SiliconANGLE article, the two Aya Expanse models were launched with open weights on hosting sites Hugging Face and Kaggle, and they used “several new core research innovations” to achieve high performance, including “synthetic data and human feedback in late-term training.”
In a blog post, Cohere claims that Aya Expanse 32B outperforms models like Google’s Gemma 2 27B and Meta’s Llama 3.1 70B. For lower-parameter options, Aya Expanse 8B also demonstrated advantages over other similar-sized models like Gemma 2 9B and Llama 3.1 8B. “The improvements in Aya Expanse are the result of a sustained focus on expanding how AI serves languages around the world by rethinking the core building blocks of machine learning breakthroughs,” the blog post states.
According to a VentureBeat article by Emilia David, the Aya initiative attempts to solve the problem of research being done on LLMs that don’t perform well in languages other than English. “Many LLMs eventually become available in other languages, especially for widely spoken languages, but there is difficulty in finding data to train models with the different languages,” David writes. “It can also be difficult to accurately benchmark the performance of models in different languages because of the quality of translations.”
Aya, derived from the Twi language term for “fern,” has grown into one of the world’s largest open-source multilingual projects, featuring over 513 million data points curated across 101 languages and 250 language ambassadors worldwide. This collaborative approach allows Aya’s datasets to expand research opportunities in regions where non-English AI resources remain limited.