OpenAI Unveils Realtime API, Enabling Low-Latency, Multimodal Experiences for Developers -- Pure AI

OpenAI Unveils Realtime API, Enabling Low-Latency, Multimodal Experiences for Developers

By John K. Waters
10/07/2024

OpenAI has launched the public beta of its Realtime API, which is designed for developers who want to incorporate natural, low-latency, multimodal interactions into their applications. Now accessible to all paid developers, the API enables real-time speech-to-speech conversations with minimal lag, providing a more seamless and interactive experience, the company said.

The Realtime API supports natural speech-based conversations, similar to OpenAI’s ChatGPT Advanced Voice Mode, using a set of six preset voices. This addition is poised to transform use cases ranging from language-learning applications to customer service chatbots, the company said, where the demand for smoother, faster communication is paramount.

In addition to the Realtime API, OpenAI is expanding the capabilities of its Chat Completions API, which will now support audio input and output for use cases that don't require the low-latency benefits of real-time streaming. This allows developers to feed both text and audio into the GPT-4o model, receiving responses in text, audio, or both.

"Developers no longer have to stitch together multiple models to create conversational AI experiences," OpenAI said in its announcement. "Now, they can build with just a single API call."

Before the release of the Realtime API, developers looking to build sophisticated voice assistants had to rely on a series of separate models. Audio had to be transcribed by an automatic speech recognition system like Whisper, then processed by a text model, and finally rendered into speech using a text-to-speech engine. This complex approach often resulted in delayed interactions and loss of nuance, such as emotional tone or accent.

The Realtime API was developed to simplify this process by streaming audio inputs and outputs directly. Developers can now create conversational agents that not only speak but also handle interruptions naturally. This is functionality reminiscent of ChatGPT’s Advanced Voice Mode. Additionally, the API supports function calling, which allows voice assistants to respond to user requests by performing actions, such as placing orders or retrieving personalized information.

OpenAI has already tested the Realtime API with a select group of partners. One of the early adopters, Healthify, a fitness and nutrition app, uses the API to enable natural, conversational interactions between users and its AI coach, Ria. Another partner, Speak, provider of a language learning app, uses the API to facilitate role-playing conversations, encouraging users to practice new languages in real time.

The Realtime API is now available in public beta and is powered by OpenAI’s new GPT-4o model. Developers can expect pricing for the API to start at $5 per 1 million text input tokens and $100 per 1 million audio input tokens, the company said. Audio output tokens are priced at $200 per million, which equates to approximately $0.24 per minute of audio output, the company said.

The expanded audio capabilities in the Chat Completions API are scheduled for release in the coming weeks. Both the Realtime API and the Chat Completions API will use the GPT-4o model, with pricing aligned across both services.

The Realtime API includes robust safety protections, including automated monitoring and human review of flagged interactions. OpenAI emphasized that these features are built on the same safety infrastructure used in ChatGPT’s voice capabilities and that developers are prohibited from using the API for malicious purposes.

OpenAI says it plans to introduce additional functionalities, including support for more input modalities like vision and video, increased rate limits, and integration with official SDKs for Python and Node.js. Developers can also expect expanded model support, with the inclusion of GPT-4o mini in future releases, the company said.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].