Friday, 1 May 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > Tencent improves testing creative AI models with new benchmark
AI & Compute

Tencent improves testing creative AI models with new benchmark

Last updated: July 9, 2025 4:37 pm
Published July 9, 2025
Share
Tencent improves testing creative AI models with new benchmark
SHARE

Tencent has launched a brand new benchmark, ArtifactsBench, that goals to repair present issues with testing inventive AI fashions.

Ever requested an AI to construct one thing like a easy webpage or a chart and acquired one thing that works however has a poor consumer expertise? The buttons is perhaps within the improper place, the colors would possibly conflict, or the animations really feel clunky. It’s a standard drawback, and it highlights an enormous problem on this planet of AI improvement: how do you educate a machine to have good style?

For a very long time, we’ve been testing AI fashions on their potential to write code that’s functionally right. These checks might verify the code would run, however they had been fully “blind to the visible constancy and interactive integrity that outline fashionable consumer experiences.”

That is the precise drawback ArtifactsBench has been designed to unravel. It’s much less of a take a look at and extra of an automatic artwork critic for AI-generated code

🚀Thrilled to introduce #ArtifactsBench! We’re bridging the visual-interactive hole in code technology analysis.

Our benchmark makes use of a novel automated, multimodal pipeline to evaluate LLMs on 1,825 numerous duties. An MLLM-as-Choose evaluates visible artifacts, attaining 94.4% rating… pic.twitter.com/84xClcnNyS

— Hunyuan (@TencentHunyuan) July 9, 2025

Getting it proper, like a human would ought to

So, how does Tencent’s AI benchmark work? First, an AI is given a inventive process from a listing of over 1,800 challenges, from constructing knowledge visualisations and internet apps to creating interactive mini-games.

As soon as the AI generates the code, ArtifactsBench will get to work. It robotically builds and runs the code in a secure and sandboxed atmosphere.

See also  Tencent Cloud showcases super app solution at MWC 2025

To see how the applying behaves, it captures a sequence of screenshots over time. This permits it to test for issues like animations, state adjustments after a button click on, and different dynamic consumer suggestions.

Lastly, it fingers over all this proof – the unique request, the AI’s code, and the screenshots – to a Multimodal LLM (MLLM), to behave as a choose.

This MLLM choose isn’t simply giving a obscure opinion and as an alternative makes use of an in depth, per-task guidelines to attain the consequence throughout ten completely different metrics. Scoring contains performance, consumer expertise, and even aesthetic high quality. This ensures the scoring is honest, constant, and thorough.

The massive query is, does this automated choose even have good style? The outcomes recommend it does.

When the rankings from ArtifactsBench had been in comparison with WebDev Area, the gold-standard platform the place actual people vote on the most effective AI creations, they matched up with a 94.4% consistency. This can be a huge leap from older automated benchmarks, which solely managed round 69.4% consistency.

On prime of this, the framework’s judgments confirmed over 90% settlement with skilled human builders.

Tencent evaluates the creativity of prime AI fashions with its new benchmark

When Tencent put greater than 30 of the world’s prime AI fashions via their paces, the leaderboard was revealing. Whereas prime industrial fashions from Google (Gemini-2.5-Professional) and Anthropic (Claude 4.0-Sonnet) took the lead, the checks unearthed an interesting perception.

You would possibly assume that an AI specialised in writing code could be the most effective at these duties. However the reverse was true. The analysis discovered that “the holistic capabilities of generalist fashions typically surpass these of specialised ones.”

See also  Sony testing AI to drive PlayStation characters

A general-purpose mannequin, Qwen-2.5-Instruct, truly beat its extra specialised siblings, Qwen-2.5-coder (a code-specific mannequin) and Qwen2.5-VL (a vision-specialised mannequin).

The researchers imagine it’s because creating an incredible visible software isn’t nearly coding or visible understanding in isolation and requires a mix of expertise.

“Sturdy reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” the researchers spotlight as instance important expertise. These are the sorts of well-rounded, nearly human-like skills that the most effective generalist fashions are starting to develop.

Tencent hopes its ArtifactsBench benchmark can reliably consider these qualities and thus measure future progress within the potential for AI to create issues that aren’t simply useful however what customers truly need to use.

See additionally: Tencent Hunyuan3D-PolyGen: A mannequin for ‘art-grade’ 3D belongings

Wish to study extra about AI and large knowledge from trade leaders? Take a look at AI & Big Data Expo happening in Amsterdam, California, and London. The excellent occasion is co-located with different main occasions together with Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Discover different upcoming enterprise expertise occasions and webinars powered by TechForge here.



Source link

TAGGED: benchmark, creative, improves, models, Tencent, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Hugging Face just launched a $299 robot that could disrupt the entire robotics industry Hugging Face just launched a $299 robot that could disrupt the entire robotics industry
Next Article Scaling agentic AI: Inside Atlassian’s culture of experimentation Scaling agentic AI: Inside Atlassian’s culture of experimentation
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Mayo Clinic’s secret weapon against AI hallucinations: Reverse RAG in action

Be a part of our every day and weekly newsletters for the most recent updates…

March 8, 2025

Does RAG make LLMs less safe?  Bloomberg research reveals hidden dangers

Be a part of our day by day and weekly newsletters for the newest updates…

April 28, 2025

The quiet work behind Citi’s 4,000-person internal AI rollout

For a lot of giant corporations, synthetic intelligence nonetheless lives in facet initiatives. Small groups…

January 21, 2026

Anthropic’s Claude Opus 4.5 is here: Cheaper AI, infinite chats, and coding skills that beat humans

Anthropic launched its most succesful synthetic intelligence mannequin but on Monday, slashing costs by roughly…

November 24, 2025

Castrol enables landmark proof-of-concept for immersion cooling data centres in Italy

Castrol has supported the launch of a landmark proof-of-concept for immersion-cooled information centres in Italy…

June 12, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
The evolution of encoders: From simple models to multimodal AI
AI & Compute

The evolution of encoders: From simple models to multimodal AI

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.