Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the betterdocs domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the jnews-view-counter domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-statistics domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wpdiscuz domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: 函数 _load_textdomain_just_in_time 的调用方法不正确。 jnews 域的翻译加载触发过早。这通常表示插件或主题中的某些代码运行过早。翻译应在 init 操作或之后加载。请查阅调试 WordPress来获取更多信息。（这个消息是在 6.7.0 版本添加的。） in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: 函数 _load_textdomain_just_in_time 的调用方法不正确。 jnews-like 域的翻译加载触发过早。这通常表示插件或主题中的某些代码运行过早。翻译应在 init 操作或之后加载。请查阅调试 WordPress来获取更多信息。（这个消息是在 6.7.0 版本添加的。） in /data/user/htdocs/wp-includes/functions.php on line 6114

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893
{"id":35647,"date":"2024-10-31T11:45:39","date_gmt":"2024-10-31T03:45:39","guid":{"rendered":"https:\/\/linguaresources.com\/?p=35647"},"modified":"2024-10-31T11:45:39","modified_gmt":"2024-10-31T03:45:39","slug":"ai-translated-benchmarks-can-reliably-assess-llm-performance-study-finds","status":"publish","type":"post","link":"https:\/\/linguaresources.com\/?p=35647","title":{"rendered":"AI Translated Benchmarks Can Reliably Assess LLM Performance, Study Finds"},"content":{"rendered":"\n

<\/p>\n\n\n\n

<\/p>\n\n\n

\n

On October 17, 2024, the OpenGPT-X<\/a> Team released<\/a> machine-translated versions of five well-known benchmarks in 20 European languages, enabling consistent and comparable evaluation of large language models<\/a> (LLMs).\u00a0<\/p>\n

Using these benchmarks, the team evaluated 40 state-of-the-art models<\/a> across the languages, providing valuable insights into their performance.<\/p>\n

The OpenGPT-X Team highlighted the challenges of evaluating LLM<\/a> performance consistently across languages. According to the researchers, \u201cevaluating LLM performance in a consistent and meaningful way [\u2026] remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks.\u201d<\/p>\n

They also noted the high costs and time required to create custom benchmarks for each language, which has led to a \u201cfragmented understanding of model performance\u201d across different languages. \u201cWithout comprehensive multilingual evaluations, comparisons between languages are often constrained,\u201d they explained, particularly for languages beyond the widely supported English, German, and French.<\/p>\n

To tackle this, the team employed machine-translated<\/a> versions of widely used datasets, aiming to assess whether such translations could provide scalable and uniform evaluation results.<\/p>\n

Machine-Translated Benchmarks as a Reliable Proxy<\/h3>\n
Specifically, they translated five well-known datasets \u2014 ARC<\/a> for scientific reasoning, HellaSwag<\/a> for commonsense reasoning, TruthfulQA<\/a> for factual accuracy, GSM8K<\/a> for mathematical reasoning and problem-solving abilities, and MMLU<\/a> for general knowledge and language understanding \u2014 from English into 20 European languages using DeepL<\/a>.<\/p>\n
\u201cOur goal is to determine the effectiveness of these translated benchmarks and assess whether they can substitute manually generated ones,\u201d the team stated.\u00a0<\/p>\n
Their findings suggest that machine-translated benchmarks can serve as a \u201creliable proxy\u201d for human evaluation in various languages.<\/p>\n

Top Performers and Language Trends\u00a0<\/h3>\n
Using the translated datasets, along with the multilingual FLORES-200 benchmark for translation tasks, the team evaluated 40 models across 21 European languages.\u00a0<\/p>\n
They identified Meta<\/a>\u2019s Llama-3.1-70B-Instruct<\/a> and Google<\/a>\u2019s Gemma-2-27b-Instruct as the top-performing models across multiple tasks. Llama-3.1-70B stood out in knowledge-based tasks, like answering general questions (MMLU) and solving math problems (GSM8K), as well as in commonsense reasoning (HellaSwag) and translation tasks. Meanwhile, Gemma-2-27b-Instruct excelled in scientific reasoning (ARC) and giving factually accurate answers (TruthfulQA).<\/p>\n
Smaller models like Gemma-2-9b-Instruct, though consistent in common tasks, struggled in specialized domains. The researchers noted, \u201cthe capacity of small models might not allow for reliable performance on all languages and specialized knowledge.\u201d\u00a0<\/p>\n
Additionally, high-resource languages like English, German, and French consistently saw better results, while medium-resource languages, such as Polish and Romanian, displayed weaker performance across tasks.<\/p>\n
The results are publicly available through the European LLM Leaderboard<\/a>, a multilingual evaluation platform.\u00a0<\/p>\n
The team emphasized the broader impact of their work: \u201cBy ensuring that LLMs can perform well in languages beyond English or other high-resource languages, we contribute to a more equitable digital landscape.\u201d<\/p>\n
To encourage further research, the team has made the machine-translated datasets available<\/a> to the NLP community. \u201cWe aim to foster further research and development in multilingual LLM evaluation, driving improvements in cross-lingual NLP applications,\u201d they concluded.<\/p>\n<\/div>","protected":false},"excerpt":{"rendered":"
On October 17, 2024, the OpenGPT-X Team released machin […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[391],"tags":[],"class_list":["post-35647","post","type-post","status-publish","format-standard","hentry","category-391"],"_links":{"self":[{"href":"https:\/\/linguaresources.com\/index.php?rest_route=\/wp\/v2\/posts\/35647","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/linguaresources.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/linguaresources.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/linguaresources.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/linguaresources.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=35647"}],"version-history":[{"count":0,"href":"https:\/\/linguaresources.com\/index.php?rest_route=\/wp\/v2\/posts\/35647\/revisions"}],"wp:attachment":[{"href":"https:\/\/linguaresources.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=35647"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/linguaresources.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=35647"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/linguaresources.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=35647"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}