Machine translation (MT) models often struggle with linguistic diversity, favoring dominant dialects and leaving many language varieties underserved.
机器翻译模型在应对语言多样性时存在显著局限,其资源分配向主流方言倾斜,使众多语言变体的翻译质量不尽人意。
In a February 20, 2025 paper, researchers from the University of Porto, INESC TEC, Heidelberg University, University of Beira Interior, and Ci2 – Smart Cities Research Center introduced Tradutor, the first open-source AI translation model specifically tailored for European Portuguese.
2025年2月20日,波尔图大学、INESC TEC、海德堡大学、贝拉内陆大学与Ci2智能城市研究中心联合研究团队于学术论文中发布Tradutor开源翻译模型,该模型系首个专门针对欧洲葡萄牙语开发的人工智能翻译系统。
Tradutor aims to fill the gap left by many translation models that focus mainly on Brazilian Portuguese, which is used by the majority of Portuguese speakers.
Tradutor旨在填补现有翻译模型的空白,当前主流系统主要面向巴西葡萄牙语,而使用该语言变体的用户占据了葡萄牙语使用群体的绝大多数。
The researchers explained that most MT systems prioritize Brazilian Portuguese, leaving speakers from Portugal and other regions at a disadvantage. This can be particularly problematic in areas like healthcare and legal services, where accurate language use is crucial.
研究团队指出,当前多数机器翻译系统优先考虑巴西葡萄牙语,导致葡萄牙本土及其他地区的使用者处于劣势。这一问题在医疗和法律服务等专业领域尤为突出,因为这些场景对语言使用的准确性有着严苛要求。
To address this, the researchers developed PTradutor, an extensive parallel corpus comprising over 1.7 million documents in both English and European Portuguese. This dataset spans diverse domains, including journalism, literature, web content, politics, legal documents, and social media, providing a rich linguistic foundation for training.
为解决这一问题,研究团队开发了PTradutor平行语料库。该语料库包含超过170万份英语与欧洲葡萄牙语的双语对照文档,涵盖新闻、文学、网络内容、政治、法律文书和社交媒体等多个领域,为模型训练提供了丰富的语言学基础。
“We provide the community with the largest translation dataset for European Portuguese and English,” they said.
研究团队表示:“我们为社区提供了规模最大的欧洲葡萄牙语和英语互译数据集。”
The corpus was meticulously curated through a process of collecting monolingual European Portuguese texts, translating them into English with Google Translate — due to its accessibility and relatively high quality — and implementing rigorous quality checks to maintain data integrity.
该语料库经过精心构建,其流程包括:收集欧洲葡萄牙语单语文本,通过谷歌翻译将其转换为英语(因其易用性和相对较高的翻译质量),并实施严格的质量控制以确保数据完整性。
Using this dataset, the researchers fine-tuned three open-source large language models (LLMs) — Google’s Gemma-2 2B, Microsoft’s Phi-3 mini, and Meta’s LLaMA-3 8B — to create an AI translation model adept at translating English into European Portuguese. The fine-tuning process involved both full model training and parameter-efficient techniques like Low-Rank Adaptation (LoRA).
研究团队基于该数据集对三款开源大语言模型进行了微调——包括谷歌的Gemma-2 2B、微软的Phi-3 mini和Meta的LLaMA-3 8B——最终开发出一款擅长将英语翻译为欧洲葡萄牙语的AI翻译模型。微调过程涉及全模型训练和参数高效优化技术,如低秩适应(LoRA)。
重大成就
Early tests show that Tradutor performs better than many existing open-source systems and gets close to some of the best closed-sourced industry models.
初步测试表明,Tradutor的性能优于许多现有开源系统,并接近部分顶尖的业界闭源模型。
Specifically, the fine-tuned LLaMA-3 8B model outperformed existing open-source systems and approached industry-standard closed-source models, such as Google Translate and DeepL, in translation quality.
具体而言,经过微调的LLaMA-3 8B模型在翻译质量上超越了现有开源系统,并接近谷歌翻译和DeepL等业界标准的闭源模型水平。
“Our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese,” the researchers highlighted.
研究团队强调:“我们的最佳模型在葡萄牙语翻译上超越了现有开源系统,并在欧洲葡萄牙语翻译性能上接近行业领先的闭源系统水平。”
They also emphasized that the goal was not necessarily to surpass commercial models but to “propose a computationally efficient, adaptable, and resource-efficient method for adapting small language models to translate specific language varieties.” Achieving results close to industry-leading models marks a “significant accomplishment,” according to the researchers.
研究团队同时强调,其目标并非一定要超越商业模型,而是”提出一种计算高效、适应性强且资源节约的方法,用于适配小型语言模型以翻译特定语言变体”。研究团队表示,取得接近行业领先模型的成果标志着”一项重大突破”。
While Tradutor was developed as a case study for European Portuguese, the researchers noted that the same methodology could be applied to other languages facing similar challenges.
尽管Tradutor是作为欧洲葡萄牙语的案例研究开发的,但研究团队指出,该方法论同样适用于面临类似挑战的其他语言。
By open-sourcing the PTradutor dataset, the code to replicate it, and the Tradutor model, they aim to encourage further research and development in language variety-specific MT, promoting greater linguistic inclusivity in AI-powered systems.
通过开源PTradutor数据集、可复现代码及Tradutor模型,研究团队旨在推动面向特定语言变体的机器翻译的研究与发展,促进AI系统实现更大的语言包容性。
“We aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties,” they concluded.
“我们旨在支持和鼓励更多研究,推动未被充分代表的小众语言变体研究取得进展。”他们总结道。
原文链接: https://slator.com/meet-tradutor-the-first-open-source-ai-translation-model-for-european-portuguese/
特别说明:本文内容选自Slator官网,仅供学习交流使用,如有侵权请后台联系小编删除。