Friday, 20 Feb 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Tencent improves testing creative AI models with new benchmark
AI

Tencent improves testing creative AI models with new benchmark

Last updated: July 9, 2025 4:37 pm
Published July 9, 2025
Share
Tencent improves testing creative AI models with new benchmark
SHARE

Tencent has launched a brand new benchmark, ArtifactsBench, that goals to repair present issues with testing inventive AI fashions.

Ever requested an AI to construct one thing like a easy webpage or a chart and acquired one thing that works however has a poor consumer expertise? The buttons is perhaps within the improper place, the colors would possibly conflict, or the animations really feel clunky. It’s a standard drawback, and it highlights an enormous problem on this planet of AI improvement: how do you educate a machine to have good style?

For a very long time, we’ve been testing AI fashions on their potential to write code that’s functionally right. These checks might verify the code would run, however they had been fully “blind to the visible constancy and interactive integrity that outline fashionable consumer experiences.”

That is the precise drawback ArtifactsBench has been designed to unravel. It’s much less of a take a look at and extra of an automatic artwork critic for AI-generated code

🚀Thrilled to introduce #ArtifactsBench! We’re bridging the visual-interactive hole in code technology analysis.

Our benchmark makes use of a novel automated, multimodal pipeline to evaluate LLMs on 1,825 numerous duties. An MLLM-as-Choose evaluates visible artifacts, attaining 94.4% rating… pic.twitter.com/84xClcnNyS

— Hunyuan (@TencentHunyuan) July 9, 2025

Getting it proper, like a human would ought to

So, how does Tencent’s AI benchmark work? First, an AI is given a inventive process from a listing of over 1,800 challenges, from constructing knowledge visualisations and internet apps to creating interactive mini-games.

As soon as the AI generates the code, ArtifactsBench will get to work. It robotically builds and runs the code in a secure and sandboxed atmosphere.

See also  From disruption to reinvention: How knowledge workers can thrive after AI

To see how the applying behaves, it captures a sequence of screenshots over time. This permits it to test for issues like animations, state adjustments after a button click on, and different dynamic consumer suggestions.

Lastly, it fingers over all this proof – the unique request, the AI’s code, and the screenshots – to a Multimodal LLM (MLLM), to behave as a choose.

This MLLM choose isn’t simply giving a obscure opinion and as an alternative makes use of an in depth, per-task guidelines to attain the consequence throughout ten completely different metrics. Scoring contains performance, consumer expertise, and even aesthetic high quality. This ensures the scoring is honest, constant, and thorough.

The massive query is, does this automated choose even have good style? The outcomes recommend it does.

When the rankings from ArtifactsBench had been in comparison with WebDev Area, the gold-standard platform the place actual people vote on the most effective AI creations, they matched up with a 94.4% consistency. This can be a huge leap from older automated benchmarks, which solely managed round 69.4% consistency.

On prime of this, the framework’s judgments confirmed over 90% settlement with skilled human builders.

Tencent evaluates the creativity of prime AI fashions with its new benchmark

When Tencent put greater than 30 of the world’s prime AI fashions via their paces, the leaderboard was revealing. Whereas prime industrial fashions from Google (Gemini-2.5-Professional) and Anthropic (Claude 4.0-Sonnet) took the lead, the checks unearthed an interesting perception.

You would possibly assume that an AI specialised in writing code could be the most effective at these duties. However the reverse was true. The analysis discovered that “the holistic capabilities of generalist fashions typically surpass these of specialised ones.”

See also  Small models as paralegals: LexisNexis distills models to build AI assistant

A general-purpose mannequin, Qwen-2.5-Instruct, truly beat its extra specialised siblings, Qwen-2.5-coder (a code-specific mannequin) and Qwen2.5-VL (a vision-specialised mannequin).

The researchers imagine it’s because creating an incredible visible software isn’t nearly coding or visible understanding in isolation and requires a mix of expertise.

“Sturdy reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” the researchers spotlight as instance important expertise. These are the sorts of well-rounded, nearly human-like skills that the most effective generalist fashions are starting to develop.

Tencent hopes its ArtifactsBench benchmark can reliably consider these qualities and thus measure future progress within the potential for AI to create issues that aren’t simply useful however what customers truly need to use.

See additionally: Tencent Hunyuan3D-PolyGen: A mannequin for ‘art-grade’ 3D belongings

Wish to study extra about AI and large knowledge from trade leaders? Take a look at AI & Big Data Expo happening in Amsterdam, California, and London. The excellent occasion is co-located with different main occasions together with Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Discover different upcoming enterprise expertise occasions and webinars powered by TechForge here.



Source link

TAGGED: benchmark, creative, improves, models, Tencent, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Gcore expands LATAM edge to meet surging demand for real-time gaming Gcore expands LATAM edge to meet surging demand for real-time gaming
Next Article inshur Inshur Receives $35M in Growth Capital From Trinity Capital
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Samsung Creates Lab to Research Chips for AI’s Next Phase | DCN

(Bloomberg) -- Samsung Electronics Firm has arrange a analysis lab devoted to designing a completely new…

March 19, 2024

Artificial nerve with organic transistor design shows promise for brain-machine interfaces

Synthetic nerve based mostly on n-type sv-OECTs. Credit score: Nature Electronics (2025). DOI: 10.1038/s41928-025-01357-7 Lately,…

March 27, 2025

Whisper Raises $1M in Funding

Whisper, a NYC-based firm creating AI-driven gross sales and behavioral cloning expertise for content material…

March 4, 2025

RheumaGen Raises $15M in Series A Funding

RheumaGen, an Aurora, CO-based cell and gene remedy firm, raised $15M in Sequence A funding.…

January 11, 2025

JLL appointed by Pure DC as FM operator at new Abu Dhabi data centre

JLL has been appointed by Pure Information Centres Group, a number one supplier of hyperscale…

March 11, 2025

You Might Also Like

Coca-Cola turns to AI marketing as price-led growth slows
AI

Coca-Cola turns to AI marketing as price-led growth slows

By saad
DBS pilots system that lets AI agents make payments for customers
AI

DBS pilots system that lets AI agents make payments for customers

By saad
How AI upgrades enterprise treasury management
AI

How AI upgrades enterprise treasury management

By saad
Infosys AI implementation framework offers business leaders guidance
AI

Infosys AI implementation framework offers business leaders guidance

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.