Not too long ago, spoken language fashions (SLMs) have been highlighted as next-generation know-how that surpasses the constraints of text-based language fashions by studying human speech with out textual content to know and generate linguistic and non-linguistic data.
Nevertheless, current fashions present important limitations in producing long-duration content material required for podcasts, audiobooks, and voice assistants.
Ph.D. candidate, Sejin Park, from Professor Yong Man Ro’s analysis workforce on the Korea Superior Institute of Science and Expertise’s (KAIST) College of Electrical Engineering, has succeeded in overcoming these limitations by growing “SpeechSSM,” which allows constant and pure speech technology with out time constraints.
The work has been published on the arXiv preprint server and is about to be offered as at ICML (Worldwide Convention on Machine Studying) 2025.
A serious benefit of SLMs is their capacity to straight course of speech with out intermediate textual content conversion, leveraging the distinctive acoustic traits of human audio system, permitting for the speedy technology of high-quality speech even in large-scale fashions.
Nevertheless, current fashions confronted difficulties in sustaining semantic and speaker consistency for long-duration speech resulting from elevated “speech token decision” and reminiscence consumption when capturing very detailed data by breaking down speech into fantastic fragments.
SpeechSSM employs a “hybrid construction” that alternately locations “consideration layers” specializing in current data and “recurrent layers” that keep in mind the general narrative movement (long-term context). This enables the story to movement easily with out dropping coherence even when producing speech for a very long time.
Moreover, reminiscence utilization and computational load don’t enhance sharply with enter size, enabling secure and environment friendly studying and the technology of long-duration speech.
SpeechSSM successfully processes unbounded speech sequences by dividing speech information into brief, fastened items (home windows), processing every unit independently, after which combining them to create lengthy speech.
Moreover, within the speech technology section, it makes use of a “Non-Autoregressive” audio synthesis mannequin (SoundStorm), which quickly generates a number of elements directly as a substitute of slowly creating one character or one phrase at a time, enabling the quick technology of high-quality speech.
Whereas current fashions usually evaluated brief speech fashions of about 10 seconds, Se Jin Park created new analysis duties for speech technology primarily based on their self-built benchmark dataset, “LibriSpeech-Lengthy,” able to producing as much as 16 minutes of speech.
In comparison with PPL (Perplexity), an current speech mannequin analysis metric that solely signifies grammatical correctness, she proposed new analysis metrics akin to “SC-L (semantic coherence over time)” to evaluate content material coherence over time, and “N-MOS-T (naturalness imply opinion rating over time)” to guage naturalness over time, enabling simpler and exact analysis.
Via these new evaluations, it was confirmed that speech generated by the SpeechSSM spoken language mannequin persistently featured particular people talked about within the preliminary immediate, and new characters and occasions unfolded naturally and contextually persistently, regardless of long-duration technology.
This contrasts sharply with current fashions, which tended to simply lose their subject and exhibit repetition throughout long-duration technology.
Sejin Park defined, “Present spoken language fashions had limitations in long-duration technology, so our purpose was to develop a spoken language mannequin able to producing long-duration speech for precise human use.”
She added, “This analysis achievement is anticipated to tremendously contribute to varied varieties of voice content material creation and voice AI fields like voice assistants, by sustaining constant content material in lengthy contexts and responding extra effectively and rapidly in actual time than current strategies.”
This analysis, with Se Jin Park as the primary creator, was performed in collaboration with Google DeepMind.
Extra data:
Se Jin Park et al, Lengthy-Kind Speech Era with Spoken Language Fashions, arXiv (2024). DOI: 10.48550/arxiv.2412.18603
Accompanying demo: SpeechSSM Publications.
Quotation:
Researcher develops ‘SpeechSSM,’ opening up prospects for a 24-hour AI voice assistant (2025, July 4)
retrieved 5 July 2025
from https://techxplore.com/information/2025-07-speechssm-possibilities-hour-ai-voice.html
This doc is topic to copyright. Aside from any truthful dealing for the aim of personal research or analysis, no
half could also be reproduced with out the written permission. The content material is offered for data functions solely.
