2024年11月26日 07:01
以下文章来源于国际翻译动态 ,作者吕佩佩
谷歌发现大语言模型冗长性
最常见的形式是“拒译”
Google Finds ‘Refusal to Translate’ Most Common Form of LLM Verbosity
( 图片来自Slator官网 )
In an October 1, 2024 paper, researchers from Google identified a key challenge in evaluating large language models (LLMs) for machine translation (MT): verbosity.
2024年10月1日,谷歌的研究人员发表了一篇论文,指出在评估用于机器翻译(MT)的大语言模型(LLM)时面临一个关键问题,即冗长性。
The term verbosity refers to instances where LLMs offer reasoning insights behind their translation choices, provide multiple translations, or even refuse to translate certain content.
这一术语指大语言模型会给出其译文背后的推理见解,提供多种翻译版本,甚至拒译某些内容。
The researchers explained that, unlike traditional MT systems — which are explicitly trained and optimized for producing a single translation for a given source text — LLMs tend to take a “more conversational approach.” This behavior challenges traditional evaluation frameworks, which are designed for more structured input-output models.
研究人员对此做出解释:传统机器翻译系统经过明确的训练和优化后,为给定的原文生成1种译文。而大语言模型则不同,往往倾向于采用“更具对话形式的方法”。这种特性给传统评估框架带来了挑战,因为这些框架是为结构更清晰的输入输出模型设计的。
After analyzing several LLMs, the researchers found that verbosity is widespread, but its degree varies across models. OpenAI’s GPT-4 and Cohere’s Aya23 were the least verbose, whereas Google’s Gemini-1.5-Pro emerged as the most verbose LLM, often providing commentary or alternative translations. Mistral AI’s Mistral-Large and Anthropic’s Claude-3.5 exhibited moderate verbosity, while Meta’s LLaMA-3-70B, Cohere’s CommandR+, and Microsoft’s Phi-3-Medium showed low levels of verbosity.
分析了多个大语言模型后,研究人员发现冗长性普遍存在,但在各模型间程度不一。OpenAI的GPT-4和Cohere的Aya23是冗长性最低的模型,而谷歌的Gemini-1.5-Pro模型则冗长性最高,常附加说明或给出备选译文。Mistral AI的Mistral-Large模型和Anthropic的Claude-3.5模型冗长性适中,而Meta的LLaMA-3-70B模型、Cohere的CommandR+模型和Microsoft的Phi-3-Medium模型具有较低的冗长性。
The most common form of verbosity observed was LLMs refusing to translate certain content. For instance, Claude-3.5 frequently refused to translate, while Gemini-1.5-Pro and Mistral-Large exhibited a more balanced mix of refusal to translate and commentary, leaning slightly towards the latter.
观察到的最常见的冗长形式是大语言模型拒译某些内容。例如,Claude-3.5经常拒译,而Gemini-1.5-Pro和Mistral-Large则在拒译和提供说明之间较为均衡,但略微倾向于后者。
Triggers for Verbosity
触发冗长性的因素
According to the researchers, LLMs typically refuse to translate when they encounter potentially harmful or copyrighted content, or when faced with non-natural language input like URLs or code snippets.
据研究人员称,大语言模型在遇到可能有害或受版权保护的内容时,或在遇到网址(URL)或代码片段等非自然语言输入时,通常会拒译。
These triggers are prioritized differently across LLMs. Claude-3.5, for instance, is particularly sensitive to safety and copyright concerns, while Gemini-1.5-Pro and Mistral-Large primarily refuse to translate non-linguistic content. Additionally, some LLMs, such as Phi-3-Medium and Aya23, return empty outputs instead of verbose explanations when they refuse to translate.
不同的大语言模型对这些触发因素的优先级排序各不相同。例如,Claude-3.5对安全和版权问题特别敏感,而Gemini-1.5-Pro和Mistral-Large则主要拒译非语言内容。此外,比如Phi-3-Medium和Aya23等一些大语言模型在拒译时,会生成空白内容而不是给出冗长的解释。
Beyond refusals, LLMs can produce verbose outputs that contextualize their translation choices, providing alternative options or additional commentary. This behavior is particularly prominent in Gemini-1.5-Pro and Mistral-Large, though it is notably absent in GPT-4 and Aya23. The researchers pointed out that “short input segments lacking sufficient context are the primary reason for this verbose behavior.”
除了拒译外,大语言模型还可以输出长篇大论,根据上下文说明其翻译选择,给出备选译文或附加说明。Gemini-1.5-Pro和Mistral-Large尤其容易出现这种情况,而这在GPT-4和Aya23中是不会出现的。研究人员指出:“造成这种冗长输出的主要原因是输入段落简短且缺乏足够的上下文信息。”
Evaluation Challenges
评估挑战
One major concern raised by the researchers is that existing automatic and human evaluation frameworks do not account for verbose behaviors, often penalizing models that exhibit verbosity.
研究人员担忧的一个主要问题是,现有的自动评估和人工评估框架在评估模型性能时,没有考虑到模型输出的冗长性,因此往往会对呈现出冗长性的模型作出不利的评价。
This can distort LLM performance rankings. In their analysis, models like Gemini-1.5-Pro and Claude-3.5 ranked lower when their verbose outputs were included but performed much better when verbosity was excluded from the evaluation.
这不能使大语言模型得到其该有的性能排名。在研究人员的分析中,当把Gemini-1.5-Pro和Claude-3.5等模型的冗长输出纳入评估时,它们的排名较低;反之,这两者的表现则要好很多。
“This discrepancy highlights that current […] metrics do not adequately account for the nuanced outputs, leading to potentially misleading rankings,” the researchers noted.
研究人员指出:“这种不一致凸显了当前使用的[…]评估指标没有充分考虑这些细微差别,从而导致模型排名可能有误导性。”
Context-Aware Evaluation
上下文感知评估
There are two possible solutions to address this issue: either modify LLM outputs to fit standardized evaluation metrics or update evaluation frameworks to better accommodate the varied responses of LLMs.
针对这一问题,有两种可能的解决方案:一是修改大语言模型输出以适应标准化评估指标;二是更新评估框架以更全面地评估大语言模型的各种输出。
For example, verbosity could be minimized via prompts, or the output structure could be adjusted to separate commentary from core translations.
例如,可以使用特定的提示来减少模型的冗长输出;或调整输出结构,将解释性说明与核心翻译分开。
However, these methods do not entirely solve the problem. As the researchers pointed out, “they may not account for all verbosity-induced errors, especially refusal” and “they make no attempt to reward useful verbosity.”
然而,这些方法并不能完全解决冗长性这一问题。正如研究人员所指出的:“这无法涵盖所有由冗长性引起的错误,特别是由于拒译而导致的错误”,而且“也没有尝试对有用的冗长输出作出有利的评价。”
The researchers argue that context-aware evaluations are necessary to accurately assess the quality of verbose outputs. Specifically, handling cases where LLMs refuse to translate poses the greatest challenge, and they recommend that future evaluation protocols and datasets account for these behaviors.
研究人员认为,为了准确评估冗长输出的质量,很有必要进行上下文感知评估。特别是在处理大语言模型拒译的情况时,会面临极大的挑战。他们建议,未来的评估协议和数据集应将冗长性考虑在内。
“We hope this paper raises awareness of the premises and pitfalls of evaluating LLM outputs and inspires future studies to address them directly,” the researchers concluded.
研究人员总结道:“我们希望这篇论文能够提高大家对大语言模型输出结果评估的前提条件及潜在陷阱的认识,启发今后的研究以直接解决这些问题。”
特别说明:本文仅用于学术交流,如有侵权请后台联系小编删除。