Monday, 12 Jan 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
AI

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Last updated: November 8, 2025 1:41 am
Published November 8, 2025
Share
Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
SHARE

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, bettering and optimizing AI brokers in containerized environments.

The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in practical developer environments.

With a tougher and rigorously verified activity set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench,” wrote co-creator Alex Shaw on X. “It’s for agent, mannequin, and benchmark builders and researchers who wish to consider and enhance brokers and fashions.”

Increased Bar, Cleaner Information

Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, changing into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with methods by the command line, mimicking how builders work behind the scenes of the graphical consumer interface.

Nevertheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the neighborhood as poorly specified or unstable because of exterior service adjustments.

Model 2.0 addresses these points instantly. The up to date suite contains 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the issue ceiling whereas bettering reliability and reproducibility.

See also  Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney

A notable instance is the download-youtube activity, which was eliminated or refactored in 2.0 because of its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers might discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We imagine it is because activity high quality is considerably larger within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for working and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

  • Analysis of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

  • Customized benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of 1000’s of rollouts throughout the creation of the brand new benchmark. It’s now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI’s Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success fee — the very best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

See also  Eaton launches SmartRack modular data centers for distributed artificial intelligence in enterprises

The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes might be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m “<mannequin>” -a “<agent>” –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows centered on agentic reasoning, code technology, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress masking the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

Source link

TAGGED: agents, Containers, framework, harbor, launches, TerminalBench, testing
Share This Article
Twitter Email Copy Link Print
Previous Article Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US Deep Green proposes $120 million sustainable data centre investment in downtown Lansing, US
Next Article Ultra-thin 3D display delivers wide-angle, highly-detailed images Ultra-thin 3D display delivers wide-angle, highly-detailed images
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Zayo announces European business – Data Centre Review

To offer the perfect experiences, we use applied sciences like cookies to retailer and/or entry…

May 9, 2024

MiniMax unveils open source LLM with staggering 4M token context

Be part of our day by day and weekly newsletters for the newest updates and…

January 15, 2025

KIT-AR Closes £2.82M Funding Round

KIT-AR, a London, UK-based supplier of an augmented employee platform for the manufacturing business, raised…

March 2, 2024

Lesira-Teq transforms water systems into advanced metering solutions

Lesira-Teq, a South African Original Equipment Manufacturer known for its innovative smart water meters, is thrilled…

January 23, 2024

Gradial Raises $13M in Series A Funding

Gradial, a Seattle, CA-based AI startup creating with AI advertising and marketing operations brokers, raised…

March 18, 2025

You Might Also Like

Portrait of Two Diverse Developers Working on Computers, Typing Lines of Code that Appear on Big Screens Surrounding Them. Male and Female Programmers Creating Innovative Software, Fixing Bugs.
Global Market

At CES, Nvidia launches Vera Rubin platform for AI data centers

By saad
Autonomy without accountability: The real AI risk
AI

Autonomy without accountability: The real AI risk

By saad
The future of personal injury law: AI and legal tech in Philadelphia
AI

The future of personal injury law: AI and legal tech in Philadelphia

By saad
How AI code reviews slash incident risk
AI

How AI code reviews slash incident risk

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.