Thursday, 29 Jan 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Tencent improves testing creative AI models with new benchmark
AI

Tencent improves testing creative AI models with new benchmark

Last updated: July 9, 2025 4:37 pm
Published July 9, 2025
Share
Tencent improves testing creative AI models with new benchmark
SHARE

Tencent has launched a brand new benchmark, ArtifactsBench, that goals to repair present issues with testing inventive AI fashions.

Ever requested an AI to construct one thing like a easy webpage or a chart and acquired one thing that works however has a poor consumer expertise? The buttons is perhaps within the improper place, the colors would possibly conflict, or the animations really feel clunky. It’s a standard drawback, and it highlights an enormous problem on this planet of AI improvement: how do you educate a machine to have good style?

For a very long time, we’ve been testing AI fashions on their potential to write code that’s functionally right. These checks might verify the code would run, however they had been fully “blind to the visible constancy and interactive integrity that outline fashionable consumer experiences.”

That is the precise drawback ArtifactsBench has been designed to unravel. It’s much less of a take a look at and extra of an automatic artwork critic for AI-generated code

🚀Thrilled to introduce #ArtifactsBench! We’re bridging the visual-interactive hole in code technology analysis.

Our benchmark makes use of a novel automated, multimodal pipeline to evaluate LLMs on 1,825 numerous duties. An MLLM-as-Choose evaluates visible artifacts, attaining 94.4% rating… pic.twitter.com/84xClcnNyS

— Hunyuan (@TencentHunyuan) July 9, 2025

Getting it proper, like a human would ought to

So, how does Tencent’s AI benchmark work? First, an AI is given a inventive process from a listing of over 1,800 challenges, from constructing knowledge visualisations and internet apps to creating interactive mini-games.

As soon as the AI generates the code, ArtifactsBench will get to work. It robotically builds and runs the code in a secure and sandboxed atmosphere.

See also  Fractal-based metamaterial improves sound fields in car cabins

To see how the applying behaves, it captures a sequence of screenshots over time. This permits it to test for issues like animations, state adjustments after a button click on, and different dynamic consumer suggestions.

Lastly, it fingers over all this proof – the unique request, the AI’s code, and the screenshots – to a Multimodal LLM (MLLM), to behave as a choose.

This MLLM choose isn’t simply giving a obscure opinion and as an alternative makes use of an in depth, per-task guidelines to attain the consequence throughout ten completely different metrics. Scoring contains performance, consumer expertise, and even aesthetic high quality. This ensures the scoring is honest, constant, and thorough.

The massive query is, does this automated choose even have good style? The outcomes recommend it does.

When the rankings from ArtifactsBench had been in comparison with WebDev Area, the gold-standard platform the place actual people vote on the most effective AI creations, they matched up with a 94.4% consistency. This can be a huge leap from older automated benchmarks, which solely managed round 69.4% consistency.

On prime of this, the framework’s judgments confirmed over 90% settlement with skilled human builders.

Tencent evaluates the creativity of prime AI fashions with its new benchmark

When Tencent put greater than 30 of the world’s prime AI fashions via their paces, the leaderboard was revealing. Whereas prime industrial fashions from Google (Gemini-2.5-Professional) and Anthropic (Claude 4.0-Sonnet) took the lead, the checks unearthed an interesting perception.

You would possibly assume that an AI specialised in writing code could be the most effective at these duties. However the reverse was true. The analysis discovered that “the holistic capabilities of generalist fashions typically surpass these of specialised ones.”

See also  Researchers find you don’t need a ton of data to train LLMs for reasoning tasks

A general-purpose mannequin, Qwen-2.5-Instruct, truly beat its extra specialised siblings, Qwen-2.5-coder (a code-specific mannequin) and Qwen2.5-VL (a vision-specialised mannequin).

The researchers imagine it’s because creating an incredible visible software isn’t nearly coding or visible understanding in isolation and requires a mix of expertise.

“Sturdy reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” the researchers spotlight as instance important expertise. These are the sorts of well-rounded, nearly human-like skills that the most effective generalist fashions are starting to develop.

Tencent hopes its ArtifactsBench benchmark can reliably consider these qualities and thus measure future progress within the potential for AI to create issues that aren’t simply useful however what customers truly need to use.

See additionally: Tencent Hunyuan3D-PolyGen: A mannequin for ‘art-grade’ 3D belongings

Wish to study extra about AI and large knowledge from trade leaders? Take a look at AI & Big Data Expo happening in Amsterdam, California, and London. The excellent occasion is co-located with different main occasions together with Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Discover different upcoming enterprise expertise occasions and webinars powered by TechForge here.



Source link

TAGGED: benchmark, creative, improves, models, Tencent, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Gcore expands LATAM edge to meet surging demand for real-time gaming Gcore expands LATAM edge to meet surging demand for real-time gaming
Next Article inshur Inshur Receives $35M in Growth Capital From Trinity Capital
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Duos Edge AI, Accu-Tech Accelerate U.S. Edge Computing Deployment

Duos Edge AI has entered right into a strategic partnership with Accu-Tech to speed up…

October 9, 2024

ZEDEDA introduces edge solution for managing deployments locally and from the cloud

ZEDEDA, an edge administration and orchestration supplier, has launched a brand new answer for air-gapped…

May 2, 2024

Web vs. Application Servers: Key Differences Explained

Understanding the excellence between internet servers and application servers is important for enterprise builders, system…

June 19, 2025

IBM debuts open source Granite 3.0 LLMs for enterprise AI

Be part of our every day and weekly newsletters for the newest updates and unique…

October 21, 2024

Isembard Raises $9M in Seed Funding

Isembard, a London, UK-based software-first manufacturing firm, raised $9M in Seed funding. The spherical was…

April 24, 2025

You Might Also Like

White House predicts AI growth will boost GDP
AI

White House predicts AI growth will boost GDP

By saad
Franny Hsiao, Salesforce: Scaling enterprise AI
AI

Franny Hsiao, Salesforce: Scaling enterprise AI

By saad
Deloittes guide to agentic AI stresses governance
AI

Deloittes guide to agentic AI stresses governance

By saad
Riello UPS reveals upgraded Sentinel Pro2 and Dual2 models
Design

Riello UPS reveals upgraded Sentinel Pro2 and Dual2 models

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.