Tencent improves testing of creative AI models with new benchmarks

Tencent has introduced a new benchmark Artifactsbench, which aims to fix current issues in testing creative AI models.

Have you ever asked your AI to build something like a simple web page or chart, but have you ever received something that works but has a low user experience? The buttons may be in the wrong place, the colors may collide, or the animation may feel clunky. It is a general problem and highlights a major challenge in the world of AI development. How do you teach a machine a good taste?

For a long time, we have been testing AI models for their ability to write functionally correct code. These tests can confirm that the code is executed, but it was completely “blinded the visual fidelity and interactive integrity that define the latest user experience.”

This is the exact problem that Artifactsbench designed to solve. It’s not a test, it’s an automated art critic of AI-generated code

Introducing the thrill #artifactsbench! Fills the gaps in visual interactions in code generation evaluation.

Our benchmarks use a new automated multimodal pipeline to evaluate LLMS on a variety of 1,825 tasks. MLLM-As-Judge evaluates visual artifacts and achieves a ranking of 94.4%… pic.twitter.com/84xclcnnys

– Hunyuan (@tencenthunyuan) July 9, 2025

Do it right like a human I’ll do it Should

So, how does Tencent’s AI benchmark work? First, AI is given creative tasks from a catalog of over 1,800 challenges, ranging from data visualization and building web apps to creating interactive mini-games.

When AI generates the code, Artifactsbench works. Automatically build and run code in a safe, sandboxed environment.

Capture a series of screenshots over time to see how your application works. This allows you to see animations, changes in state after button clicks, other dynamic user feedback, and more.

Finally, to act as a judge, we hand over all of this evidence to the Multimodal LLM (MLLM), including the original request, AI code, and screenshots.

This MLLM judge not only gives vague opinions, but also uses a detailed per-task checklist to obtain results with 10 different metrics. Scoring includes features, user experience, and even aesthetic quality. This ensures that scoring is fair, consistent and thorough.

The big question is, does this automated judge actually have a good taste? The results suggest that.

When rankings from Artifactsbench compared to Webdev Arena, the gold standard platform where real humans vote for the best AI works, they matched with 94.4% consistency. This is a major leap from the old automated benchmarks, with only about 69.4% consistency being managed.

In addition to this, the framework’s decision was over 90% agreement with professional human developers.

Tencent evaluates the creativity of top AI models with new benchmarks

The leaderboard revealed when Tencent placed more than 30 of the world’s top AI models at a pace. The top commercial models from Google (Gemini-2.5-Pro) and Humanity (Claude 4.0-Sonnet) were the leader, but testing unearthed fascinating insights.

You might think that writing code-specific AI will be the best in these tasks. But the opposition was true. This study found that “the overall capabilities of generalist models often outweigh the capabilities of specialized models.”

The general purpose model, Qwen-2.5-Instruct, actually beat the more specialized siblings, Qwen-2.5-Coder (code-specific model) and QWEN2.5-VL (vision characteristics model).

Researchers believe this is because not only creating good visual applications, but also the need to combine skills, not just for a standalone visual understanding.

“Robust reasoning, the next subtle teaching, and the sense of aesthetics of implicit design,” the researchers emphasized as an example of important skills. These are the balanced, almost human-like types of abilities that the best generalist models are beginning to develop.

Tencent hopes that Artifactsbench’s benchmarks will ensure that they evaluate these qualities and measure future advances in AI not only being functional, but also in the ability of users to create what they want to actually use.

reference: Tencent Hunyuan3D-Polygen: “Art Grade” 3D Asset Model

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London. The comprehensive event will be held in collaboration with other major events, including the Intelligent Automation Conference, Blockx, Digital Transformation Week, and Cyber Security & Cloud Expo.

Check out other upcoming Enterprise Technology events and webinars with TechForge here.

US - NEA

Company

Tencent improves testing of creative AI models with new benchmarks

Do it right like a human I’ll do it Should

Tencent evaluates the creativity of top AI models with new benchmarks

LEAVE A REPLY Cancel reply

Subscribe

Malaysia launched its first AI-powered bank, RYT Bank

US stock futures after Trump fired Lisa Culinary Governor

University scrambling to respond to dema active shooter reports

Jaeju, a “Korean Hawaii” issue guidelines aimed at malfunctioning foreign visitors

Trump DOJ’s official reverses false election claims made as host of Fox News

More like this
Related

Malaysia launched its first AI-powered bank, RYT Bank

US stock futures after Trump fired Lisa Culinary Governor

University scrambling to respond to dema active shooter reports

Jaeju, a “Korean Hawaii” issue guidelines aimed at malfunctioning foreign visitors

About us

Editor's Picks

National Guard will begin carrying weapons in DC this weekend

Guarantees like Article 5 float for Ukraine. What does that mean?

Americans worry about democracy amid gerrymander fights, polls found

The latest

Malaysia launched its first AI-powered bank, RYT Bank

US stock futures after Trump fired Lisa Culinary Governor

University scrambling to respond to dema active shooter reports

Subscribe

US - NEA

Company

Tencent improves testing of creative AI models with new benchmarks

Do it right like a human I’ll do it Should

Tencent evaluates the creativity of top AI models with new benchmarks

LEAVE A REPLY Cancel reply

Subscribe

More like thisRelated

About us

Editor's Picks

The latest

Subscribe

More like this
Related