Samsung is overcoming the limitations of existing benchmarks to better assess the actual productivity of AI models in enterprise settings. Developed by Samsung Research, the new system named TrueBench aims to address the growing gap between theoretical AI performance and actual utility in the workplace.
As businesses around the world accelerated adoption of large-scale language models (LLMS) and improved operations, challenges emerged. This is a way to accurately assess its effectiveness. Many existing benchmarks focus on academic or general knowledge tests, often limited to English and simple questions and answer formats. This created a gap in which AI models leave companies without a reliable way to assess how they work in complex, multilingual, and context-rich business tasks.
Samsung’s TrueBench stands for reliable real-world use assessment benchmark and was developed to fill this void. It offers a comprehensive suite of metrics to evaluate LLMs based on scenarios and tasks directly related to the real-world corporate environment. This benchmark is based on the extensive internal enterprise use of Samsung’s own AI model and ensures that the evaluation criteria are based on true workplace requirements.
This framework evaluates common enterprise functions such as content creation, data analysis, long document summary, and translation. These are divided into 10 different categories and 46 subcategories, providing a detailed view of AI’s productivity capabilities.
“Samsung Research brings deep expertise and competitiveness through real-world AI experiences,” said Paul (Kyungwhoon) Cheun, CTO of Samsung Electronics’ DX division and head of Samsung Research. “We hope that TrueBench will establish a productivity assessment criterion.”
To tackle the limitations of older benchmarks, TrueBench is built on the basis of a diverse set of 2,485 tests supporting 12 different languages and interlinguistic scenarios. This multilingual approach is important for global companies where information flows across different regions. The test material itself reflects a wide range of workplace requirements, ranging from simple instructions of just eight characters to complex analysis of documents that exceed 20,000 characters.
Samsung has realised that in the context of real business, the user’s full intent is not necessarily explicitly stated at the initial prompt. Thus, benchmarks are designed to understand and assess the ability of AI models to meet these implicit company needs, moving beyond simple accuracy to a more nuanced measure of utility and relevance.
To achieve this, Samsung Research has developed a unique collaborative process between human experts and AI, creating productivity scoring standards. Initially, human annotators establish criteria for evaluation for a particular task. The AI then checks these standards and checks for potential errors, internal conflicts, or unnecessary constraints that may not reflect realistic user expectations. Following AI feedback, human annotators improve the standards. This iterative loop ensures that the final evaluation criteria are accurate and reflects high quality results.
This cross-validated process provides an automated rating system that scores LLMS performance. By using AI to apply these sophisticated criteria, the system minimizes subjective bias that can occur in human-only scoring, ensuring consistency and reliability across all tests. TrueBench also employs a strict scoring model in which the AI model must meet all the criteria associated with the test and receive a passing mark. All or any approach to individual conditions allows for a more detailed and rigorous assessment of the performance of AI models across a variety of enterprise tasks.
Samsung is making public releases its TrueBench data samples and leaderboards on its global open source platform to increase transparency and promote wider adoption. This allows developers, researchers and companies to directly compare productivity performance simultaneously for up to five different AI models. The platform provides a clear overview of how different AIS stack up with each other in real-world tasks.
At the time of writing, here are the top 20 models, based on the overall rankings based on Samsung AI benchmarks.
The published complete data also includes the average length of the AI-generated response. This allows you to compare performance as well as efficiency at the same time. This is an important consideration for businesses considering operating costs and speed.
With the launch of TrueBench, Samsung aims to not only release another tool, but also change the way the industry thinks about AI performance. By moving goalposts from abstract knowledge to concrete productivity, Samsung’s benchmarks can play a role in helping organizations integrate into their workflows and make better decisions about the enterprise AI model to bridge the gap between AI potential and its proven value.
reference: Huawei’s plan to make thousands of AI chips think like a single computer
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London. The comprehensive event is part of TechEx and will be held in collaboration with other major technology events. Click here for more information.
AI News is equipped with TechForge Media. Check out upcoming Enterprise Technology events and webinars here.