Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the betterdocs domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the jnews-view-counter domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-statistics domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wpdiscuz domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: 函数 _load_textdomain_just_in_time 的调用方法不正确jnews 域的翻译加载触发过早。这通常表示插件或主题中的某些代码运行过早。翻译应在 init 操作或之后加载。 请查阅调试 WordPress来获取更多信息。 (这个消息是在 6.7.0 版本添加的。) in /data/user/htdocs/wp-includes/functions.php on line 6114

Notice: 函数 _load_textdomain_just_in_time 的调用方法不正确jnews-like 域的翻译加载触发过早。这通常表示插件或主题中的某些代码运行过早。翻译应在 init 操作或之后加载。 请查阅调试 WordPress来获取更多信息。 (这个消息是在 6.7.0 版本添加的。) in /data/user/htdocs/wp-includes/functions.php on line 6114

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /data/user/htdocs/wp-includes/functions.php:6114) in /data/user/htdocs/wp-includes/rest-api/class-wp-rest-server.php on line 1893
{"id":35647,"date":"2024-10-31T11:45:39","date_gmt":"2024-10-31T03:45:39","guid":{"rendered":"https:\/\/linguaresources.com\/?p=35647"},"modified":"2024-10-31T11:45:39","modified_gmt":"2024-10-31T03:45:39","slug":"ai-translated-benchmarks-can-reliably-assess-llm-performance-study-finds","status":"publish","type":"post","link":"https:\/\/linguaresources.com\/?p=35647","title":{"rendered":"AI Translated Benchmarks Can Reliably Assess LLM Performance, Study Finds"},"content":{"rendered":"\n

<\/p>\n\n\n\n

<\/p>\n\n\n

\n

On October 17, 2024, the OpenGPT-X<\/a> Team released<\/a> machine-translated versions of five well-known benchmarks in 20 European languages, enabling consistent and comparable evaluation of large language models<\/a> (LLMs).\u00a0<\/p>\n

Using these benchmarks, the team evaluated 40 state-of-the-art models<\/a> across the languages, providing valuable insights into their performance.<\/p>\n

The OpenGPT-X Team highlighted the challenges of evaluating LLM<\/a> performance consistently across languages. According to the researchers, \u201cevaluating LLM performance in a consistent and meaningful way [\u2026] remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks.\u201d<\/p>\n

They also noted the high costs and time required to create custom benchmarks for each language, which has led to a \u201cfragmented understanding of model performance\u201d across different languages. \u201cWithout comprehensive multilingual evaluations, comparisons between languages are often constrained,\u201d they explained, particularly for languages beyond the widely supported English, German, and French.<\/p>\n

To tackle this, the team employed machine-translated<\/a> versions of widely used datasets, aiming to assess whether such translations could provide scalable and uniform evaluation results.<\/p>\n

Machine-Translated Benchmarks as a Reliable Proxy<\/h3>\n

Specifically, they translated five well-known datasets \u2014 ARC<\/a> for scientific reasoning, HellaSwag<\/a> for commonsense reasoning, TruthfulQA<\/a> for factual accuracy, GSM8K<\/a> for mathematical reasoning and problem-solving abilities, and MMLU<\/a> for general knowledge and language understanding \u2014 from English into 20 European languages using DeepL<\/a>.<\/p>\n

\u201cOur goal is to determine the effectiveness of these translated benchmarks and assess whether they can substitute manually generated ones,\u201d the team stated.\u00a0<\/p>\n

Their findings suggest that machine-translated benchmarks can serve as a \u201creliable proxy\u201d for human evaluation in various languages.<\/p>\n