Friday, 20 Mar 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
AI

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Last updated: November 8, 2025 1:41 am
Published November 8, 2025
Share
Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
SHARE

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, bettering and optimizing AI brokers in containerized environments.

The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in practical developer environments.

With a tougher and rigorously verified activity set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench,” wrote co-creator Alex Shaw on X. “It’s for agent, mannequin, and benchmark builders and researchers who wish to consider and enhance brokers and fashions.”

Increased Bar, Cleaner Information

Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, changing into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with methods by the command line, mimicking how builders work behind the scenes of the graphical consumer interface.

Nevertheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the neighborhood as poorly specified or unstable because of exterior service adjustments.

Model 2.0 addresses these points instantly. The up to date suite contains 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the issue ceiling whereas bettering reliability and reproducibility.

See also  South Korea scraps AI textbook programme

A notable instance is the download-youtube activity, which was eliminated or refactored in 2.0 because of its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers might discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We imagine it is because activity high quality is considerably larger within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for working and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

  • Analysis of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

  • Customized benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of 1000’s of rollouts throughout the creation of the brand new benchmark. It’s now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI’s Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success fee — the very best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

See also  ARC Prize launches its toughest AI benchmark yet: ARC-AGI-2

The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes might be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m “<mannequin>” -a “<agent>” –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows centered on agentic reasoning, code technology, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress masking the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

Source link

TAGGED: agents, Containers, framework, harbor, launches, TerminalBench, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US
Next Article Ultra-thin 3D display delivers wide-angle, highly-detailed images Ultra-thin 3D display delivers wide-angle, highly-detailed images
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Data Center Regulation Trends to Watch in 2025

Authorities our bodies worldwide are placing laws in place to enhance the sustainability and resiliency of knowledge…

November 6, 2024

Hyper automation trends for 2025

As we strategy 2025, hyper automation continues to drive transformative alternate options all through industries.…

November 14, 2024

BlueShift Raises $2.1M in Pre-Seed Funding

Blueshift, a Boston, MA-based electrochemical local weather tech innovator, raised $2.1M in Pre-Seed funding. Backers…

March 20, 2025

CapitaLand Investment launches research paper on Asia Pacific data center investment strategies in the age of digitalization | News

CapitaLand Funding (CLI) has launched a analysis paper on funding methods for Asia Pacific’s knowledge…

July 14, 2024

nLighten’s Milton Keynes site receives solar upgrade

To offer one of the best experiences, we use applied sciences like cookies to retailer…

March 26, 2024

You Might Also Like

NVIDIA Agent Toolkit Gives Enterprises a Framework to Deploy AI Agents at Scale
AI

NVIDIA Agent Toolkit Gives Enterprises a Framework to Deploy AI Agents at Scale

By saad
Visa prepares payment systems for AI agent-initiated transactions
AI

Visa prepares payment systems for AI agent-initiated transactions

By saad
For effective AI, insurance needs to get its data house in order
AI

For effective AI, insurance needs to get its data house in order

By saad
Mastercard keeps tabs on fraud with new foundation model
AI

Mastercard keeps tabs on fraud with new foundation model

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.