OpenAI introduces GPT-Realtime-2. New audio models will revolutionise the voice agent market

OpenAI has just taken a significant step towards a future in which talking to a machine ceases to resemble exchanging messages over a walkie-talkie and begins to resemble natural human dialogue. The San Francisco-based giant has released three new audio models to developers, which are set to finally break away from the ‘AI as a simple transcriptionist’ patch. This strategic shift in emphasis from text to live voice paves the way for building agents capable of listening, translating and taking action in real time.

A key component of this offensive, the GPT-Realtime-2 model is designed to deal with the challenges that have so far compromised most voice assistants: handling interruptions in speech and maintaining threads during long, multi-threaded conversations. The model’s ability to ‘call tools’ means that the AI not only answers questions, but can, for example, book an appointment in the calendar or check the status of an order during a conversation, reacting in real time to the caller’s responses.

In parallel, OpenAI is introducing solutions dedicated to specific market needs. The GPT-Realtime-Translate model, which supports more than 70 languages, targets global industries such as tourism or education, offering almost instant translation of conversations. GPT-Realtime-Whisper, on the other hand, redefines the concept of meeting notes by providing precise speech-to-text conversion for live text, allowing project updates to be generated at the same time as decisions are made.

Businesses are already testing these solutions in practice. Companies such as Priceline and Zillow see them as an opportunity to revolutionise customer service and sales processes, where speed of response directly affects the bottom line. Deutsche Telekom has also joined the ranks of testers, suggesting that the telecoms sector sees OpenAI technology as an opportunity to deeply automate call centres.

OpenAI’s pricing strategy reflects its ambition for widespread adoption of these tools. While the GPT-Realtime-2 model requires an investment of $32 per million audio tokens, simpler services like Whisper are available for a fraction of this amount, allowing companies to scale solutions flexibly. This sends a clear signal to the market: OpenAI no longer wants to be just the creator of the most popular chatbot, but the foundation on which the next generation of intelligent, talking operating systems for business will be built.