
Amazon Nova Sonic Brings Unified Speech Understanding and Generation to Amazon Bedrock
SEATTLE, April 8, 2025 — Amazon.com Inc. today introduced Amazon Nova Sonic, a new foundation model that unifies speech understanding and speech generation into a single model, to enable more human-like voice conversations in artificial intelligence (AI) applications.
Available in Amazon Bedrock via a new bi-directional streaming API, the model simplifies the development of voice applications, such as customer service call automation and AI agents across a broad range of industries, including travel, education, healthcare, entertainment, and more.
“From the invention of the world’s best personal AI assistant with Alexa, to developing AWS services like Connect, Lex, and Polly that are used across a wide range of industries, Amazon has long believed that voice-powered applications can make all of our customers’ lives better and easier,” said Rohit Prasad, SVP of Amazon Artificial General Intelligence. “With Amazon Nova Sonic, we are releasing a new foundation model in Amazon Bedrock that makes it simpler for developers to build voice-powered applications that can complete tasks for customers with higher accuracy, while being more natural, and engaging.”
Traditional approaches to building voice-enabled applications involve complex orchestration of multiple models, such as speech recognition to convert speech to text, large language models (LLMs) to understand and generate responses, and text-to-speech to convert text back to audio. This fragmented approach not only increases development complexity but also fails to preserve crucial acoustic context and nuances like tone, prosody, and speaking style that are essential for natural conversations.
Nova Sonic solves these challenges through a unified model architecture that delivers speech understanding and generation, without requiring a separate model for each of these steps. This unification enables the model to adapt the generated voice response to the acoustic context (e.g. tone, style) and the spoken input, resulting in more natural dialog.
Nova Sonic even understands the nuances of human conversation, including the speaker’s natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins. It also generates a text transcript for the user’s speech, enabling developers to use that text to call specific tools and APIs for building voice-enabled AI agents (e.g., an AI-powered travel agent that can book flights by retrieving up to date flight information). These capabilities, along with its lightning-fast inference, make voice applications powered by Nova Sonic more natural and useful.
State-Of-The-Art Accuracy and Quality
Nova Sonic has been rigorously tested against a wide range of industry standard benchmarks for speech understanding and generation, demonstrating exceptional quality and accuracy for human-like, real-time voice conversations.
The model excels in natural dialog handling, seamlessly understanding and adapting to pauses, hesitations, and interruptions while maintaining conversational context throughout the interaction. This capability contributed to strong performance for overall quality and accuracy in turn-taking tests.
Nova Sonic demonstrates strong performance on overall conversation quality compared to other models in the industry, which at this time include a select few with similar real-time conversational speech capabilities, such as OpenAI’s GPT-4o (Realtime) and Google Gemini Flash 2.0 (available via Gemini’s experimental live API).
For example, single-turn dialogs in its American English masculine-sounding voice achieved a 51.0% and 69.7% win-rate against OpenAI’s GPT-4o (Realtime) and Google’s Gemini Flash 2.0 respectively, based on the Common Eval data set. Likewise, Nova Sonic’s American English feminine-sounding voice scored 50.9% and 66.3% win-rate against OpenAI’s GPT-4o (Realtime) and Google’s Gemini Flash 2.0 respectively on the same data set. Nova Sonic also exceeds performance for its British English feminine-sounding voice, scoring a 58.3% win-rate against OpenAI’s GPT-4o (Realtime).
Since recognizing spoken words is critical in generating accurate responses, measuring Nova Sonic’s speech recognition accuracy in terms of word error rate (WER) across a wide range of languages, dialects, and accents is also critical. On the Multilingual LibriSpeech, Nova Sonic achieved a WER of 4.2%, which is 36.4% relative lower than OpenAI’s GPT-4o Transcribe model, when averaged across English, French, Italian, German and Spanish.
On English utterances of the Multilingual LibriSpeech (MLS) data set, it has 24.2% relative lower WER compared to OpenAI’s GPT-4o Transcribe model.
Nova Sonic is also robust to noisy conditions, with 46.7% relative lower WER for English compared to OpenAI’s GPT-4o Transcribe model measured on Augmented Multi Party Interaction (AMI) meeting benchmark that consists of real-world noisy and multi-speaker interactions.
Tool-Use for Function Calling and Agentic Workflows
Nova Sonic also supports tool-use for applications—like customer service call automation—that require the responses to be factually grounded in enterprise data, such as pricing plans, available inventory, and schedule availability. Nova Sonic’s native tool-use also enables the model to resolve complex customer queries and complete tasks on behalf of customers, for example, “make a reservation” or “find alternate flights.”
Multiple Native Voices and Speaking Styles
Nova Sonic supports three expressive voices, including both masculine-sounding and feminine-sounding voices now generally available in English, and supports speech generation in different English accents including American and British. Support for additional languages and accents will be coming soon.
Industry-Leading Speed and Price Performance
Nova Sonic delivers an average customer-perceived latency of 1.09 seconds from the time the customer is done talking to the time the system generates the first speech response. This is compared to 1.18 seconds for OpenAI’s GPT-4o (Realtime), and 1.41 seconds for Google’s Gemini Flash 2.0 (available via Gemini’s experimental live API), per benchmarking by Artificial Analysis.
Nova Sonic is the most cost-efficient model in the industry, when compared to models that have similar functionality of real-time speech conversations and have public pricing available. For example, it is nearly 80% less expensive than OpenAI’s GPT-4o (Realtime).
Amazon Is Committed to the Responsible Development of Artificial Intelligence
Amazon Nova models are built with integrated safety measures and protections. The company has launched AWS AI Service Cards for Nova models, offering transparent information on use cases, limitations, and responsible AI practices.
To get started with Amazon Nova models, visit: https://aws.amazon.com/nova.
Source: Amazon