Saturday, 13 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds
AI

OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

Last updated: March 22, 2025 6:24 pm
Published March 22, 2025
Share
OpenAI's new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds
SHARE

Be part of our day by day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Study Extra


OpenAI‘s voice AI fashions have gotten it into bother earlier than with actor Scarlett Johansson, however that isn’t stopping the corporate from persevering with to advance its choices on this class.

At the moment, the ChatGPT maker has unveiled three new proprietary voice fashions: gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts. These fashions will initially be out there by the ChatGPT maker’s software programming interface (API) for third-party software program builders to construct their very own apps. They can even be out there on a customized demo website, OpenAI.fm, that particular person customers can entry for restricted testing and enjoyable.

Furthermore, the gpt-4o-mini-tts mannequin voices could be custom-made from a number of pre-sets through textual content immediate to vary their accents, pitch, tone and different vocal qualities — together with conveying no matter feelings the person asks them to, which ought to go a protracted solution to addressing any issues OpenAI is intentionally imitating any specific person’s voice (the corporate beforehand denied that was the case with Johansson, however pulled down the ostensibly imitative voice choice, anyway). Now, it’s as much as the person to determine how they need their AI voice to sound when talking again.

In a demo with VentureBeat delivered over a video name, OpenAI technical workers member Jeff Harris confirmed how, utilizing textual content alone on the demo website, a person might get the identical voice to sound like a cackling mad scientist or a zen, calm yoga instructor.

Discovering and refining new capabilities inside GPT-4o base

The fashions are variants of the present GPT-4o mannequin OpenAI launched again in Could 2024 and which at the moment powers the ChatGPT textual content and voice expertise for a lot of customers, however the firm took that base mannequin and post-trained it with extra knowledge to make it excel at transcription and speech. The corporate didn’t specify when the fashions may come to ChatGPT.

See also  3 killer apps for cloud-based generative AI

“ChatGPT has barely completely different necessities when it comes to price and efficiency trade-offs, so whereas I count on they may transfer to those fashions in time, for now, this launch is concentrated on API customers,” Harris mentioned.

It’s meant to supersede OpenAI’s two-year-old Whisper open-source text-to-speech mannequin, providing decrease phrase error charges throughout {industry} benchmarks and improved efficiency in noisy environments, with numerous accents, and at various speech speeds throughout 100+ languages.

The corporate posted a chart on its web site exhibiting simply how a lot decrease the gpt-4o-transcribe fashions’ error charges are at figuring out phrases throughout 33 languages in comparison with Whisper — with an impressively low 2.46% in English.

“These fashions embrace noise cancellation and a semantic voice exercise detector, which helps decide when a speaker has completed a thought, bettering transcription accuracy,” mentioned Harris.

Harris advised VentureBeat that the brand new gpt-4o-transcribe mannequin household isn’t designed to supply “diarization,” or the potential to label and differentiate between completely different audio system. As an alternative, it’s designed primarily to obtain one (or probably a number of voices) as a single enter channel and reply to all inputs with a single output voice in that interplay, nevertheless lengthy it takes.

The corporate is additionally internet hosting a contest for most people to search out probably the most inventive examples of utilizing its demo voice website OpenAI.fm and share them on-line by tagging the @openAI account on X. The winner will obtain a customized Teenage Engineering radio with the OpenAI emblem, which OpenAI Head of Product, Platform Olivier Godement mentioned is certainly one of solely three on the planet.

An audio functions gold mine

The enhancements make them notably well-suited for functions equivalent to buyer name facilities, assembly notice transcription, and AI-powered assistants.

Impressively, the corporate’s newly launched Brokers SDK from final week additionally permits these builders who’ve already constructed apps atop its text-based massive language fashions just like the common GPT-4o so as to add fluid voice interactions with solely about “9 strains of code,” in response to a presenter throughout an OpenAI YouTube livestream asserting the brand new fashions (embedded above).

See also  Dfinity launches Caffeine, an AI platform that builds production apps from natural language prompts

For instance, an e-commerce app constructed atop GPT-4o might now reply to turn-based person questions like “Inform me about my final orders” in speech with simply seconds of tweaking the code by including these new fashions.

“For the primary time, we’re introducing streaming speech-to-text, permitting builders to repeatedly enter audio and obtain a real-time textual content stream, making conversations really feel extra pure,” Harris mentioned.

Nonetheless, for these devs in search of low-latency, real-time AI voice experiences, OpenAI recommends utilizing its speech-to-speech fashions within the Realtime API.

Pricing and availability

The brand new fashions can be found instantly through OpenAI’s API, with pricing as follows:

• gpt-4o-transcribe: $6.00 per 1M audio enter tokens (~$0.006 per minute)

• gpt-4o-mini-transcribe: $3.00 per 1M audio enter tokens (~$0.003 per minute)

• gpt-4o-mini-tts: $0.60 per 1M textual content enter tokens, $12.00 per 1M audio output tokens (~$0.015 per minute)

Nevertheless, they arrive at a time of fiercer-than-ever competitors within the AI transcription and speech house, with devoted speech AI companies equivalent to ElevenLabs providing their new Scribe mannequin, which helps diarization and boasts a equally (however not as low) decreased error charge of three.3% in English. It’s priced at $0.40 per hour of enter audio (or $0.006 per minute, roughly equal).

One other startup, Hume AI, provides a brand new mannequin, Octave TTS, with sentence-level and even word-level customization of pronunciation and emotional inflection — based mostly fully on the person’s directions, not any pre-set voices. The pricing of Octave TTS isn’t straight comparable, however there’s a free tier providing 10 minutes of audio and prices enhance from there between

In the meantime, extra superior audio and speech fashions are additionally coming to the open supply group, together with one referred to as Orpheus 3B which is available with a permissive Apache 2.0 license, which means builders don’t should pay any prices to run it — supplied they’ve the precise {hardware} or cloud servers.

See also  A step towards smarter, web-native AI agents

Business adoption and early outcomes

In accordance with testimonials shared by OpenAI with VentureBeat, a number of firms have already built-in OpenAI’s new audio fashions into their platforms, reporting vital enhancements in voice AI efficiency.

EliseAI, an organization centered on property administration automation, discovered that OpenAI’s text-to-speech mannequin enabled extra pure and emotionally wealthy interactions with tenants.

The improved voices made AI-powered leasing, upkeep, and tour scheduling extra partaking, resulting in larger tenant satisfaction and improved name decision charges.

Decagon, which builds AI-powered voice experiences, noticed a 30% enchancment in transcription accuracy utilizing OpenAI’s speech recognition mannequin.

This enhance in accuracy has allowed Decagon’s AI brokers to carry out extra reliably in real-world situations, even in noisy environments. The mixing course of was fast, with Decagon incorporating the brand new mannequin into its system inside a day.

Not all reactions to OpenAI’s newest launch have been heat. Daybreak AI app analytics software program co-founder Ben Hylak (@benhylak), a former Apple human interfaces designer, posted on X that whereas the fashions appear promising, the announcement “looks like a retreat from real-time voice,” suggesting a shift away from OpenAI’s earlier concentrate on low-latency conversational AI through ChatGPT.

Moreover, the launch was preceded by an early leak on X (previously Twitter). TestingCatalog News (@testingcatalog) posted particulars on the brand new fashions a number of minutes earlier than the official announcement, itemizing the names of gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe. The leak was credited to @StivenTheDev, and the put up rapidly gained traction.

Nevertheless, trying forward, OpenAI plans to proceed refining its audio fashions and exploring customized voice capabilities whereas making certain security and accountable AI use. Past audio, OpenAI can be investing in multimodal AI, together with video, to allow extra dynamic and interactive agent-based experiences.


Source link
TAGGED: add, apps, Existing, gpt4otranscribe, lets, Model, OpenAIs, Seconds, Speech, text, voice
Share This Article
Twitter Email Copy Link Print
Previous Article MIT develops breakthrough quantum interconnect for scalable computing MIT develops breakthrough quantum interconnect for scalable computing
Next Article Using perovskite to make LED pixels as small as a virus Using perovskite to make LED pixels as small as a virus
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Oshkosh launches EV and other tech for specialty vehicles with safety, quiet and cleanliness in mind

Be a part of our each day and weekly newsletters for the most recent updates…

January 8, 2025

Chemonics Acquires Luvent Consulting

Chemonics International, a Washington, DC-based sustainable growth firm, introduced its acquisition of Luvent Consulting, a Berlin,…

November 6, 2024

Bybit Advances Regulatory Compliance, Temporarily Adjusts EEA Operations

Dubai, United Arab Emirates, December thirteenth, 2024, Chainwire Bybit, the world’s second-largest cryptocurrency trade by…

December 13, 2024

Tencent Cloud unveils AIoT 2.0 to integrate multimodal AI in global smart devices

Tencent Cloud introduced the improve of its AIoT 2.0 product options, integrating {hardware} and software…

August 22, 2025

Pulnovo Medical Receives Investment from EQT and Qiming Venture Partners

Pulnovo Medical, a Hong Kong-based globally pioneer in medical units for pulmonary hypertension (PH) and…

May 23, 2025

You Might Also Like

Google’s new framework helps AI agents spend their compute and tool budget more wisely
AI

Google’s new framework helps AI agents spend their compute and tool budget more wisely

By saad
BBVA embeds AI into banking workflows using ChatGPT Enterprise
AI

BBVA embeds AI into banking workflows using ChatGPT Enterprise

By saad
Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks
AI

Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

By saad
Experimental AI concludes as autonomous systems rise
AI

Experimental AI concludes as autonomous systems rise

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.