New Research from Welo Data Establishes a Multilingual Framework for Evaluating Causal Reasoning in Large Language ModelsWelo Data 发布新研究,建立了一个多语言框架,用于评估大型语言模型中的因果推理能力
Welo Data has released a groundbreaking research paper, “A Novel Framework for Testing Causal Reasoning in LLMs: Design, Data Collection, and Evaluation,” which introduces a robust multilingual methodology for assessing the causal reasoning capabilities of large language models (LLMs). The study reveals significant gaps in existing AI models’ ability to consistently and accurately process causal relationships, particularly across languages with diverse typological and linguistic features.
Welo Data 发布了一篇开创性的研究论文“测试大型语言模型因果推理能力的新框架:设计、数据收集和评估”,其中介绍了一种强大的多语言方法,用于评估大语言模型(LLMs)的因果推理能力。该研究揭示了现有人工智能模型在持续、精准地处理因果推理方面存在显著的差距,尤其是在跨语言多类型的言语特征方面更为突出。Addressing Critical Gaps in AI Reasoning 解决人工智能推理中的关键问题
Causal reasoning—the ability to understand cause-and-effect relationships—is a fundamental step toward achieving artificial general intelligence (AGI). While LLMs demonstrate proficiency in pattern recognition and statistical correlations, they are still in the early stages of mastering causal reasoning. The top performing model answers seven out of 10 questions correctly while the average model correctly answers only a little more than half.
因果推理——即理解因果关系的能力——是实现人工通用智能(AGI)的基本步骤。虽然大语言模型在模式识别和统计相关性方面表现出了熟练的能力,但在掌握因果推理方面仍处于初级阶段。测试中表现最佳的模型能正确回答10个问题中的七个,而性能表现一般的模型正确率仅仅略过半。
“Our research highlights that LLMs frequently fail at causal reasoning tasks, even in English, and their performance declines significantly in languages such as Turkish and Arabic,” comments Dr. Abigail Thornton, co-author of the study and Research Lab Lead at Welo Data. “This underscores the need for more linguistically diverse training data to improve AI’s ability to understand causality beyond memorized correlations.”
“我们的研究结果表明,即使在英语学习中,大预言模型也经常无法完成因果推理任务,而在土耳其语和阿拉伯语等语言中的表现则明显下降。”该研究的合著者、Welo Data数据公司研究实验室负责人阿比盖尔·桑顿(Abigail Thornton)强调,“提升人工智能理解因果关系的能力,必须采用更具语言多样性的训练数据而不仅仅是记忆相关性”。
A Multilingual and Structured Evaluation Approach 多语言和结构化评估方法
Existing causal reasoning benchmarks often fall short in both linguistic diversity and task complexity. To address this gap, the Welo Data research team developed a rigorous dataset featuring narrative-based causal reasoning prompts designed to mirror the analytical challenges faced by expert human analysts—those with advanced degrees and at least five years of professional experience. This dataset not only demands nuanced reasoning but also spans six languages: English, Spanish, Japanese, Korean, Turkish, and Standard Arabic. The study evaluated more than 20 LLMs from 10 different developers, assessing their accuracy and consistency in identifying complex causal relationships across diverse linguistic and contextual frameworks.
现有的因果推理基准往往在语言多样性和任务复杂性方面都存在不足。为了弥补这一差距, Welo Data 研究团队开发了一个严谨缜密的数据集,其特点是基于叙事的因果推理提示,旨在反映资深分析师(拥有高级学位和至少五年专业经验的分析师)所面临的挑战。该数据集不仅要求细致入微的推理,而且还涵盖了六种语言:英语、西班牙语、日语、韩语、土耳其语和标准阿拉伯语。研究评估了来自 10 个不同开发商的 20 多个大语言模型,测评了它们在多语言语境框架下识别复杂因果关系推理的准确性和一致性。
“We crafted narrative documents from different perspectives of participants involved in fact-based scenarios to test whether models could identify causality reliably,” explains Dr. Fernando Migone, co-author and Vice President, Transformation, at Welo Data. “Our findings show that many models are inconsistent—even when presented with the same logical problem from different viewpoints.” “我们从参与者的不同视角起草编写出基于事实场景的叙事文档,以测试模型是否可靠地识别因果关系。”Welo Data部门战略转型副总裁兼合著者费尔南多·米戈内(Fernando Migone)博士解释说,“我们的研究结果表明,许多模型在一致性方面存在问题——即使从不同的视角看同一个逻辑问题也是如此”。
Key Findings and Implications 关键发现和重要意义
· Performance Disparities Across Languages: English and Spanish yielded the highest accuracy, while models struggled significantly with Turkish and Arabic, likely due to linguistic complexity and lower representation in training data.
· 不同语言间的表现差异:英语和西班牙语的准确率最高,而模型在处理土耳其语和阿拉伯语时则表现较差,这可能归因于语言结构的复杂性与训练数据代表性不足有关。
· Inconsistencies in Model Responses: LLMs frequently provided different answers to identical causal questions, depending on how the prompts were structured.
· 模型反映的不一致性:对于相同的因果关系问题,大语言模型经常给出不同的答案,这取决于提示语是如何构建的。
· Challenges with Chain-of-Thought (CoT) Prompting: While some research suggests that prompting the model to ‘think through the reasoning process’ (e.g. CoT Prompting) can enhance performance, Welo Data’s findings reveal mixed results, implying additional areas of future research and study.
· 思维链 (CoT) 提示的挑战:一些研究表明,提示模型 “思考推理过程”(如思维链提示)可以提高性能表现,而 Welo Data 的研究发现则喜忧参半,这意味着未来还需要在其他方面进行研究和探索。
These results emphasize the need for improved model training methods, particularly in multilingual and complex reasoning tasks.
这些结果强调了改进模型训练方法的必要性,尤其是在多语言和复杂推理任务中。
Advancing AI’s Causal Reasoning Capabilities 提升人工智能的因果推理能力
By establishing a new benchmark for evaluating causal reasoning, Welo Data aims to drive advancements in AI research and development and help developers elevate the performance of their multilingual AI models. The team advocates for further investment in multilingual causal reasoning datasets and refined training methodologies to bridge current gaps.
通过建立新的评估因果推理的基准,Welo Data 旨在推动人工智能研发的进展,并帮助开发者提升多语言人工智能模型的表现。团队主张进一步投资于多语言因果推理数据集和完善数据训练方法,以弥补目前的差距。
“The path to AGI requires AI systems that can reason effectively, not just predict patterns,” adds Dr. Thornton. “Our research lays the groundwork for the next stage of AI development—one that prioritizes robust, cross-linguistic reasoning capabilities.”
“通往通用人工智能的道路需要人工智能系统能够有效地进行推理,而不仅仅是预测模式。”桑顿博士补充道,“我们的研究为下一阶段的人工智能发展奠定了基础,这一阶段将优先考虑强大的跨语言推理能力”。
The full research paper is available here. For more information about Welo Data’s Model Assessment Suite and its research, visit welodata.ai.
研究论文全文可在此处查阅。有关 Welo Data 的模型评估套件及其研究的更多信息,请访问welodata.ai。
About Welo Data 关于 Welo Data
Welo Data, a division of Welocalize, stands at the forefront of the AI training data industry, delivering exceptional data quality and security. Supported by a global network of over 500,000 AI training professionals and domain experts, along with cutting-edge technological infrastructure, Welo Data fulfills the growing demand for dependable training data across diverse AI applications. Its service offerings span a variety of critical areas, including data annotation and labeling, large language model (LLM) enhancement, data collection and generation, and relevance and intent assessment. Welo Data’s technical expertise ensures that datasets are not only accurate but also culturally aligned, tackling significant AI development challenges like minimizing model bias and improving inclusivity. Its NIMO (Network Identity Management and Operations) framework guarantees the highest level of accuracy and quality in AI training data by leveraging advanced workforce assurance methods. welodata.ai
Welo Data 作为 Welocalize 的一个部门,处在人工智能培训数据行业的前沿,提供卓越的数据质量和安全性。Welo Data 拥有由 50 多万名人工智能训练专业人员和领域专家的全球网络,以及先进的技术基础设施,满足了各类人工智能应用对可靠培训数据日益增长的需求。其提供的服务涵盖多个关键领域,包括数据注释和标记、大型语言模型增强、数据收集和生成以及相关性和意图评估。Welo Data 的技术专长不仅能确保数据集的准确性,还能与文化保持一致,从而应对人工智能发展的重大挑战,如最大限度地减少模型偏差和提高包容性。其 NIMO(网络身份管理和运营)框架通过利用先进的劳动力保障方法,确保人工智能训练数据较高的准确性和质量。
welodata.ai Welocalize, Inc. Welocalize, a leader in innovative translation and global content solutions, is ranked as one of the world’s largest language service providers. Specializing in optimizing customer engagement through localized content, the company has helped some of the world’s largest organizations achieve superior business outcomes with multilingual, global content. Central to its approach is OPAL, an AI-enabled platform integrating machine translation, large language models, and natural language processing to automate and enhance translations across over 250 languages.
welocalize.com Welocalize 是创新翻译和全球内容解决方案的领导者,被评为全球最大的语言服务提供商之一。该公司专注于通过本地化内容优化客户参与度,帮助了世界上一些最大的组织通过多语种全球内容实现了卓越的业务成果。其方法的核心是 OPAL,这是一个集成了机器翻译、大型语言模型和自然语言处理的人工智能赋能平台,可自动完成和加强 250 多种语言的翻译工作。welocalize.com
特别说明:本文内容选GALA官网,仅供学习交流使用,如有侵权请后台联系小编删除。