betterdocs
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114jnews-view-counter
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114wp-statistics
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114wpdiscuz
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114jnews
域的翻译加载触发过早。这通常表示插件或主题中的某些代码运行过早。翻译应在 init
操作或之后加载。 请查阅调试 WordPress来获取更多信息。 (这个消息是在 6.7.0 版本添加的。) in /data/user/htdocs/wp-includes/functions.php on line 6114jnews-like
域的翻译加载触发过早。这通常表示插件或主题中的某些代码运行过早。翻译应在 init
操作或之后加载。 请查阅调试 WordPress来获取更多信息。 (这个消息是在 6.7.0 版本添加的。) in /data/user/htdocs/wp-includes/functions.php on line 6114<\/p>\n\n\n\n
<\/p>\n\n\n
On October 17, 2024, the OpenGPT-X<\/a> Team released<\/a> machine-translated versions of five well-known benchmarks in 20 European languages, enabling consistent and comparable evaluation of large language models<\/a> (LLMs).\u00a0<\/p>\n Using these benchmarks, the team evaluated 40 state-of-the-art models<\/a> across the languages, providing valuable insights into their performance.<\/p>\n The OpenGPT-X Team highlighted the challenges of evaluating LLM<\/a> performance consistently across languages. According to the researchers, \u201cevaluating LLM performance in a consistent and meaningful way [\u2026] remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks.\u201d<\/p>\n They also noted the high costs and time required to create custom benchmarks for each language, which has led to a \u201cfragmented understanding of model performance\u201d across different languages. \u201cWithout comprehensive multilingual evaluations, comparisons between languages are often constrained,\u201d they explained, particularly for languages beyond the widely supported English, German, and French.<\/p>\n To tackle this, the team employed machine-translated<\/a> versions of widely used datasets, aiming to assess whether such translations could provide scalable and uniform evaluation results.<\/p>\n Specifically, they translated five well-known datasets \u2014 ARC<\/a> for scientific reasoning, HellaSwag<\/a> for commonsense reasoning, TruthfulQA<\/a> for factual accuracy, GSM8K<\/a> for mathematical reasoning and problem-solving abilities, and MMLU<\/a> for general knowledge and language understanding \u2014 from English into 20 European languages using DeepL<\/a>.<\/p>\n \u201cOur goal is to determine the effectiveness of these translated benchmarks and assess whether they can substitute manually generated ones,\u201d the team stated.\u00a0<\/p>\n Their findings suggest that machine-translated benchmarks can serve as a \u201creliable proxy\u201d for human evaluation in various languages.<\/p>\n Using the translated datasets, along with the multilingual FLORES-200 benchmark for translation tasks, the team evaluated 40 models across 21 European languages.\u00a0<\/p>\n They identified Meta<\/a>\u2019s Llama-3.1-70B-Instruct<\/a> and Google<\/a>\u2019s Gemma-2-27b-Instruct as the top-performing models across multiple tasks. Llama-3.1-70B stood out in knowledge-based tasks, like answering general questions (MMLU) and solving math problems (GSM8K), as well as in commonsense reasoning (HellaSwag) and translation tasks. Meanwhile, Gemma-2-27b-Instruct excelled in scientific reasoning (ARC) and giving factually accurate answers (TruthfulQA).<\/p>\n Smaller models like Gemma-2-9b-Instruct, though consistent in common tasks, struggled in specialized domains. The researchers noted, \u201cthe capacity of small models might not allow for reliable performance on all languages and specialized knowledge.\u201d\u00a0<\/p>\n Additionally, high-resource languages like English, German, and French consistently saw better results, while medium-resource languages, such as Polish and Romanian, displayed weaker performance across tasks.<\/p>\n The results are publicly available through the European LLM Leaderboard<\/a>, a multilingual evaluation platform.\u00a0<\/p>\n The team emphasized the broader impact of their work: \u201cBy ensuring that LLMs can perform well in languages beyond English or other high-resource languages, we contribute to a more equitable digital landscape.\u201d<\/p>\nMachine-Translated Benchmarks as a Reliable Proxy<\/h3>\n
Top Performers and Language Trends\u00a0<\/h3>\n