The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, bettering and optimizing AI brokers in containerized environments.

The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in practical developer environments.

With a tougher and rigorously verified activity set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench,” wrote co-creator Alex Shaw on X. “It’s for agent, mannequin, and benchmark builders and researchers who wish to consider and enhance brokers and fashions.”

Increased Bar, Cleaner Information

Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, changing into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with methods by the command line, mimicking how builders work behind the scenes of the graphical consumer interface.

Nevertheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the neighborhood as poorly specified or unstable because of exterior service adjustments.

Model 2.0 addresses these points instantly. The up to date suite contains 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the issue ceiling whereas bettering reliability and reproducibility.

A notable instance is the download-youtube activity, which was eliminated or refactored in 2.0 because of its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers might discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We imagine it is because activity high quality is considerably larger within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for working and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

Analysis of any container-installable agent
Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines
Customized benchmark creation and deployment
Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of 1000’s of rollouts throughout the creation of the brand new benchmark. It’s now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI’s Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success fee — the very best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

Codex CLI (GPT-5) — 49.6%
Codex CLI (GPT-5-Codex) — 44.3%
OpenHands (GPT-5) — 43.8%
Terminus 2 (GPT-5-Codex) — 43.4%
Terminus 2 (Claude Sonnet 4.5) — 42.8%

The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes might be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m “<mannequin>” -a “<agent>” –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows centered on agentic reasoning, code technology, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress masking the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

Source link

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Increased Bar, Cleaner Information

Harbor: Unified Rollouts at Scale

Early Outcomes: GPT-5 Leads in Activity Success

Submission and Use

Aiming for Standardization

Leave a Reply Cancel reply

Your Trusted Source for Accurate and Timely Updates!

Popular Posts

Qualcomm and Nokia Bell Labs show how multiple-vendor AI models can work together in wireless networks

NorthC expands footprint with new data centre in Arlesheim

Hidden costs in AI deployment: Why Claude models may be 20-30% more expensive than GPT in enterprise settings

nLighten expands leadership team | Data Centre Solutions

Beyond acceleration: the rise of Agentic AI

About Us

Top Categories

Useful Links