Friday, 1 May 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
AI & Compute

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Last updated: November 8, 2025 1:41 am
Published November 8, 2025
Share
Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
SHARE

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, bettering and optimizing AI brokers in containerized environments.

The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in practical developer environments.

With a tougher and rigorously verified activity set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench,” wrote co-creator Alex Shaw on X. “It’s for agent, mannequin, and benchmark builders and researchers who wish to consider and enhance brokers and fashions.”

Increased Bar, Cleaner Information

Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, changing into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with methods by the command line, mimicking how builders work behind the scenes of the graphical consumer interface.

Nevertheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the neighborhood as poorly specified or unstable because of exterior service adjustments.

Model 2.0 addresses these points instantly. The up to date suite contains 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the issue ceiling whereas bettering reliability and reproducibility.

See also  Solana's high-speed AI gains and malware losses

A notable instance is the download-youtube activity, which was eliminated or refactored in 2.0 because of its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers might discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We imagine it is because activity high quality is considerably larger within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for working and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

  • Analysis of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

  • Customized benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of 1000’s of rollouts throughout the creation of the brand new benchmark. It’s now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI’s Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success fee — the very best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

See also  Anthropic just launched a new platform that lets everyone in your company collaborate on AI — not just the tech team

The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes might be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m “<mannequin>” -a “<agent>” –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows centered on agentic reasoning, code technology, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress masking the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

Source link

TAGGED: agents, Containers, framework, harbor, launches, TerminalBench, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US
Next Article New regeneration project: West London industrial estate transformed New regeneration project: West London industrial estate transformed
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Infinite Reality will acquire agentic AI firm Touchcast for $500M

Infinite Reality — a fast-moving firm targeted on AI, immersive expertise and digital media —…

April 21, 2025

Datum Datacentres launches MCR2 | Data Centre Solutions

The ribbon was lower by Councillor Emma Taylor (Labour Councillor for the Sharston Ward), and…

July 8, 2025

SWiRL: The business case for AI that thinks like your best problem-solvers

Be part of our every day and weekly newsletters for the newest updates and unique…

April 28, 2025

AI sprint risks environmental catastrophe

The federal government is urged to mandate stricter reporting for information centres to mitigate environmental…

February 9, 2025

The value gap from AI investments is widening dangerously fast

Boston Consulting Group (BCG) has discovered a widening chasm separating an elite of AI masters…

September 30, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
STL launches Neuralis data centre connectivity suite in the U.S.
Power & Cooling

STL launches Neuralis data centre connectivity suite in the U.S.

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.