Monday, 13 Apr 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
AI

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Last updated: November 8, 2025 1:41 am
Published November 8, 2025
Share
Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
SHARE

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, bettering and optimizing AI brokers in containerized environments.

The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in practical developer environments.

With a tougher and rigorously verified activity set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench,” wrote co-creator Alex Shaw on X. “It’s for agent, mannequin, and benchmark builders and researchers who wish to consider and enhance brokers and fashions.”

Increased Bar, Cleaner Information

Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, changing into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with methods by the command line, mimicking how builders work behind the scenes of the graphical consumer interface.

Nevertheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the neighborhood as poorly specified or unstable because of exterior service adjustments.

Model 2.0 addresses these points instantly. The up to date suite contains 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the issue ceiling whereas bettering reliability and reproducibility.

See also  Snowflake and Landing AI combine forces to tackle unstructured data challenges with computer vision

A notable instance is the download-youtube activity, which was eliminated or refactored in 2.0 because of its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers might discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We imagine it is because activity high quality is considerably larger within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for working and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

  • Analysis of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

  • Customized benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of 1000’s of rollouts throughout the creation of the brand new benchmark. It’s now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI’s Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success fee — the very best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

See also  How LeapXpert uses AI to bring order and oversight to business messaging

The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes might be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m “<mannequin>” -a “<agent>” –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows centered on agentic reasoning, code technology, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress masking the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

Source link

TAGGED: agents, Containers, framework, harbor, launches, TerminalBench, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US
Next Article Ultra-thin 3D display delivers wide-angle, highly-detailed images Ultra-thin 3D display delivers wide-angle, highly-detailed images
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Nokia, Zayo Achieve Milestone in 800Gb/s Data Transmission

Nokia and Zayo have accomplished a great achievement in high-speed data transfer by successfully completing…

January 22, 2024

OptiCool and CoreSite: enhancing data centre cooling efficiency

OptiCool Applied sciences, recognized for its two-phase rear door warmth exchanger (RDHx) cooling options, has…

March 31, 2026

eStruxture Data Centers Raises C$1.8 Billion To Advance Growth

With excessive international demand for information storage and the processing of huge quantities of digital…

June 27, 2024

Oracle Jumps After New Deals Validate Cloud Effort

(Bloomberg) -- Oracle Company reported better-than-expected bookings and introduced partnership offers with tech rivals, giving…

June 12, 2024

Nvidia-Backed Firm Eyes Data Center Near Japan’s Nuclear Power

(Bloomberg) -- An Nvidia Corp.-backed cloud providers agency desires to construct a brand new knowledge middle…

October 21, 2024

You Might Also Like

Did Meta Sacrifice Its Open-Source Identity for a Competitive AI Model?
AI

Did Meta Sacrifice Its Open-Source Identity for a Competitive AI Model?

By saad
How robust AI governance protects enterprise margins
AI

How robust AI governance protects enterprise margins

By saad
Why companies like Apple are building AI agents with limits
AI

Why companies like Apple are building AI agents with limits

By saad
Agentic AI's governance challenges under the EU AI Act in 2026
AI

Agentic AI’s governance challenges under the EU AI Act in 2026

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.