Google, Unbabel Expand Key AI Translation Benchmark to 55 Languages with WMT24++
谷歌和 Unbabel 通过 WMT24++ 将关键人工智能翻译基准扩展到 55 种语言
Researchers from Google and Unbabel have unveiled WMT24++, a major expansion of the WMT24 machine translation (MT) benchmark, extending its language coverage from 9 to 55 languages and dialects.
谷歌和 Unbabel 的研究人员推出了 WMT24++,这是对 WMT24 机器翻译(MT)基准的重大扩展,将其语言覆盖范围从 9 种语言扩展到 55 种语言和方言。
The dataset now includes human-written reference translations and post-edits for 46 additional languages, as well as new post-edits of the references for 8 of the original 9 WMT24 languages. The benchmark covers four domains: literary, news, social, and speech.
新数据集为46种新语言增加了人类编写的参考翻译和校对内容,并为原始WMT24中的9种语言中的8种提供了新的参考校对内容。新数据集为46种新语言增加了人类编写的参考翻译和校对内容,并为原始WMT24中的9种语言中的8种提供了新的参考校对内容。
To compile WMT24++, the researchers collected translations from professional linguists who were “fairly compensated for their work for the region in which they live.”
为了编译 WMT24++,研究人员从专业语言学家那里收集了译文,这些语言学家 “根据他们所在地区的标准获得了公平的报酬”。
The researchers emphasized the importance of collecting benchmark datasets to evaluate multilingual large language models’ (LLMs) performance, particularly in MT.
研究人员强调了收集基准数据集以评估多语言大型语言模型(LLMs)性能的重要性,尤其是在 MT 方面。
“As large language models become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation,” they said.
他们说:“随着大型语言模型在英语之外的其他语言方面的能力越来越强,收集基准数据集以评估其多语言性能,包括在机器翻译等任务中的性能,是非常重要的。”
Markus Freitag, Head of Google Translate Research, highlighted on X that WMT24++ is Google’s second major dataset release in two days, following SMOL, a professionally translated dataset for 115 very low-resource languages. “Two new datasets from Google Translate targeting high and low resource languages!” he wrote.
谷歌翻译研究负责人马库斯-弗雷塔格(Markus Freitag)在 X平台上表示,两天后谷歌将发布仅次于SMOL的第二大数据集WMT24++。SMOL是一个涵盖115种极低资源语言的专业翻译数据集。他写道:“谷歌翻译提供了两个针对高资源和低资源语言的新数据集!”
LLMs Outperform Traditional MT SystemsLLM 优于传统 MT 系统The researchers benchmarked leading MT providers and LLMs on WMT24++ using automatic metrics, including reference-based and reference-free automatic metrics, such as MetricX-24 and MetricX-24-QE, COMET-based models, and Gemini-based scoring.研究人员使用自动评估指标(包括基于参考的和无参考的自动评估指标,如MetricX-24、MetricX-24-QE、基于COMET的模型和基于Gemini的评分)对WMT24++上的领先MT提供商和LLMs进行了基准测试。
They found that LLMs outperformed traditional MT systems across all 55 languages. OpenAI’s o1, Google’s Gemini-1.5 Pro, and Anthropic’s Claude 3.5 ranked as the top-performing systems, surpassing conventional MT providers such as Google Translate, DeepL, and Microsoft Translator. “Frontier LLMs, like OpenAI o1, Gemini-1.5 Pro, and Claude 3.5, are highly capable MT systems in all 55 languages (according to automatic metrics), outperforming standard MT providers,” the researchers noted. They also found minimal performance differences between the top LLMs.
他们发现,LLMs在55种语言中均超越了传统MT系统。OpenAI的o1、谷歌的Gemini-1.5 Pro 和 Anthropic 的 Claude 3.5 是表现最好的系统,超过了谷歌翻译、DeepL 和微软翻译等传统MT 提供商。研究人员指出:“OpenAI o1、Gemini-1.5 Pro 和 Claude 3.5 等前沿 LLMs在 55 种语言中都是高性能MT系统(根据自动衡量标准),表现优于标准 MT 供应商。” 他们还发现,顶级LLMs之间的性能差异极小。
Need for Human Evaluation人类评估的必要性
Despite LLMs outperforming traditional MT in automatic evaluation, the researchers caution against overestimating their capabilities.
尽管 LLM 在自动评估中的表现优于传统 MT,但研究人员提醒不要高估其能力。
They stress that automatic metrics may undervalue human translations due to inherent biases, and their effectiveness remains largely untested in many of the 55 languages covered by WMT24++.
他们强调,由于固有偏差,自动度量标准可能会低估人类翻译的价值,而且在 WMT24++ 所涵盖的 55 种语言中,自动度量标准的有效性在很大程度上仍未得到检验。
The researchers acknowledge that human evaluation remains crucial for assessing actual translation quality and understanding LLM limitations and plan to conduct a large-scale human evaluation in future work to validate these findings.
研究人员承认,人工评估对于评估实际翻译质量和了解 LLM 的局限性仍然至关重要,并计划在今后的工作中进行大规模的人工评估,以验证这些发现。
“We caution against using our results to immediately conclude that LLMs produce superhuman performance in all languages due to the limitations of automatic metrics, which may be biased against human translations and largely untested in most of the 55 languages,” they said.
他们表示:“我们告诫大家,不要因为我们的研究结果就立即得出结论,认为 LLM 在所有语言中都有超人的表现,因为自动度量标准存在局限性,可能对人工翻译存在偏见,而且在 55 种语言中的大多数语言中基本上都没有经过测试。”
In addition to the textual data, WMT24++ also preserves source images where available, providing full-page screenshots that researchers hope will support multimodal translation studies.
除文本数据外,WMT24++ 还保留了可用的源图像,提供全页截图,研究人员希望这些内容能够支持多模态翻译研究。
The dataset is publicly available on Hugging Face.该数据集已在Hugging Face上公开发布。
Authors: Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, and Markus Freitag
丹尼尔·德意志、埃莱夫塞里娅·布里亚库、艾萨克·卡斯威尔、马拉·芬克尔斯坦、丽贝卡·加洛尔、尤拉伊·尤拉什卡、盖扎·科瓦奇、艾莉森·刘易斯、里卡多·雷、杰森·里萨、舒鲁蒂·里赫瓦尼、帕克·赖利、伊丽莎白·塞尔斯基、费拉斯·特拉布尔斯、斯蒂芬妮·温克勒、张彪和马库斯·弗赖塔格
原文网址:https://slator.com/google-unbabel-expand-key-ai-translation-benchmark-55-languages-wmt24/