Saturday, 13 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Tencent improves testing creative AI models with new benchmark
AI

Tencent improves testing creative AI models with new benchmark

Last updated: July 9, 2025 4:37 pm
Published July 9, 2025
Share
Tencent improves testing creative AI models with new benchmark
SHARE

Tencent has launched a brand new benchmark, ArtifactsBench, that goals to repair present issues with testing inventive AI fashions.

Ever requested an AI to construct one thing like a easy webpage or a chart and acquired one thing that works however has a poor consumer expertise? The buttons is perhaps within the improper place, the colors would possibly conflict, or the animations really feel clunky. It’s a standard drawback, and it highlights an enormous problem on this planet of AI improvement: how do you educate a machine to have good style?

For a very long time, we’ve been testing AI fashions on their potential to write code that’s functionally right. These checks might verify the code would run, however they had been fully “blind to the visible constancy and interactive integrity that outline fashionable consumer experiences.”

That is the precise drawback ArtifactsBench has been designed to unravel. It’s much less of a take a look at and extra of an automatic artwork critic for AI-generated code

🚀Thrilled to introduce #ArtifactsBench! We’re bridging the visual-interactive hole in code technology analysis.

Our benchmark makes use of a novel automated, multimodal pipeline to evaluate LLMs on 1,825 numerous duties. An MLLM-as-Choose evaluates visible artifacts, attaining 94.4% rating… pic.twitter.com/84xClcnNyS

— Hunyuan (@TencentHunyuan) July 9, 2025

Getting it proper, like a human would ought to

So, how does Tencent’s AI benchmark work? First, an AI is given a inventive process from a listing of over 1,800 challenges, from constructing knowledge visualisations and internet apps to creating interactive mini-games.

As soon as the AI generates the code, ArtifactsBench will get to work. It robotically builds and runs the code in a secure and sandboxed atmosphere.

See also  Google is testing verified checkmarks in search

To see how the applying behaves, it captures a sequence of screenshots over time. This permits it to test for issues like animations, state adjustments after a button click on, and different dynamic consumer suggestions.

Lastly, it fingers over all this proof – the unique request, the AI’s code, and the screenshots – to a Multimodal LLM (MLLM), to behave as a choose.

This MLLM choose isn’t simply giving a obscure opinion and as an alternative makes use of an in depth, per-task guidelines to attain the consequence throughout ten completely different metrics. Scoring contains performance, consumer expertise, and even aesthetic high quality. This ensures the scoring is honest, constant, and thorough.

The massive query is, does this automated choose even have good style? The outcomes recommend it does.

When the rankings from ArtifactsBench had been in comparison with WebDev Area, the gold-standard platform the place actual people vote on the most effective AI creations, they matched up with a 94.4% consistency. This can be a huge leap from older automated benchmarks, which solely managed round 69.4% consistency.

On prime of this, the framework’s judgments confirmed over 90% settlement with skilled human builders.

Tencent evaluates the creativity of prime AI fashions with its new benchmark

When Tencent put greater than 30 of the world’s prime AI fashions via their paces, the leaderboard was revealing. Whereas prime industrial fashions from Google (Gemini-2.5-Professional) and Anthropic (Claude 4.0-Sonnet) took the lead, the checks unearthed an interesting perception.

You would possibly assume that an AI specialised in writing code could be the most effective at these duties. However the reverse was true. The analysis discovered that “the holistic capabilities of generalist fashions typically surpass these of specialised ones.”

See also  Breaking the 'intellectual bottleneck': How AI is computing the previously uncomputable in healthcare

A general-purpose mannequin, Qwen-2.5-Instruct, truly beat its extra specialised siblings, Qwen-2.5-coder (a code-specific mannequin) and Qwen2.5-VL (a vision-specialised mannequin).

The researchers imagine it’s because creating an incredible visible software isn’t nearly coding or visible understanding in isolation and requires a mix of expertise.

“Sturdy reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” the researchers spotlight as instance important expertise. These are the sorts of well-rounded, nearly human-like skills that the most effective generalist fashions are starting to develop.

Tencent hopes its ArtifactsBench benchmark can reliably consider these qualities and thus measure future progress within the potential for AI to create issues that aren’t simply useful however what customers truly need to use.

See additionally: Tencent Hunyuan3D-PolyGen: A mannequin for ‘art-grade’ 3D belongings

Wish to study extra about AI and large knowledge from trade leaders? Take a look at AI & Big Data Expo happening in Amsterdam, California, and London. The excellent occasion is co-located with different main occasions together with Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Discover different upcoming enterprise expertise occasions and webinars powered by TechForge here.



Source link

TAGGED: benchmark, creative, improves, models, Tencent, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Gcore expands LATAM edge to meet surging demand for real-time gaming Gcore expands LATAM edge to meet surging demand for real-time gaming
Next Article inshur Inshur Receives $35M in Growth Capital From Trinity Capital
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Iberdrola Seeks AI Data Center Stake in Spain Joint Venture

(Bloomberg) -- Iberdrola plans to supply a Spanish information heart with a grid connection and…

September 17, 2024

MedPharm and Tergus Pharma Merge

MedPharm, a Guildford, UK-based topical and transdermal Contract Improvement and Manufacturing Group (CDMO) and Ampersand…

July 9, 2024

5 Key Steps for Successful Data Center Consolidation | DCN

When it comes to data center consolidation, making the decision to consolidate is the easy…

February 2, 2024

Harnessing threat intelligence for regulatory compliance

Within the face of rising cyber threats and new rules, Cyrille Badeau, Vice-President Worldwide Gross…

February 16, 2024

New Compute Exchange service answers GPU pricing queries

Compute Trade and Silicon Information, Bochev added “are additionally engaged on growing clearer benchmarks for…

August 14, 2025

You Might Also Like

Google’s new framework helps AI agents spend their compute and tool budget more wisely
AI

Google’s new framework helps AI agents spend their compute and tool budget more wisely

By saad
BBVA embeds AI into banking workflows using ChatGPT Enterprise
AI

BBVA embeds AI into banking workflows using ChatGPT Enterprise

By saad
Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks
AI

Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

By saad
Experimental AI concludes as autonomous systems rise
AI

Experimental AI concludes as autonomous systems rise

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.