OpenAI Realtime API Walkthrough

The next era of building low-latency bots

Oct 10, 2024

TLDR;

OpenAI’s new Realtime API, available in public beta, enables developers to build fast, multimodal, low-latency interactions, including real-time speech-to-speech conversations without needing intermediary text processing. It integrates natural voice capabilities, supports function calling, and can trigger actions or pull context dynamically. This API is set to transform voice assistants, customer support, and language learning tools by providing smoother, more lifelike AI experiences.

Existing Challenges in Building Text and Voice Bots

Developers have long faced issues with latency and disjointed experiences when building voice-based AI tools. Prior solutions often involved integrating multiple models - one for speech-to-text conversion and another for text-to-speech generation. Here’s a breakdown on how different stages will look like:

Figure 1 - Architecture of a OpenAI powered bot

This added delays, affected conversation quality, and resulted in less natural interactions. Bots that required complex understanding, emotional nuance, or the ability to engage in real-time had to trade off speed for quality, often leaving users frustrated by noticeable delays or robotic, monotone outputs.

What Difference Does the OpenAI Realtime API Make

OpenAI’s Realtime API addresses these long-standing challenges by enabling real-time, multimodal AI interactions that feel natural. Here are key innovations:

• Speech-to-Speech Communication: The API allows direct conversion between spoken inputs and outputs, bypassing the traditional text intermediary, significantly reducing latency and creating a more fluid conversational experience.

• Natural and Expressive Voices: The API can modify the AI’s tone, inflection, and even respond to emotional cues (e.g., laughing or whispering). This makes interactions feel more human-like, especially useful for customer service, virtual assistants, and language learning apps.

• Multimodal Flexibility: With support for both text and audio inputs and outputs, the Realtime API offers versatility. Text can be used for moderation, while audio ensures smooth, faster-than-realtime responses .

Moreover, the Realtime API integrates with tools like Twilio for seamless voice call capabilities, allowing developers to create applications that act on behalf of users, such as placing orders or retrieving personalized data .

Despite its advantages, the API’s pricing may be a hurdle for small developers. At $0.06 per minute of audio input and $0.24 per minute of audio output, it is an investment, but one that brings significant improvements to the quality and speed of voice-enabled AI systems .

Here’s how you can get started with simple Node.js code, find a working project on the github here: https://github.com/AwaisKamran/openai-realtime-api

1. Establishing a Socket connection with OpenAI

The realtime API works runs on sockets and you have to establish a secure connection, for that Im using the https://www.npmjs.com/package/ws. You would need to create a nodeJs project, make sure you add a .env file comprising of OPENAI_API_KEY, which will be required in authorizing the web sockets connection.

import dotenv from 'dotenv';
import WebSocket from "ws";

dotenv.config();

const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01";

const ws = new WebSocket(url, {
    headers: {
        "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
        "OpenAI-Beta": "realtime=v1",
        "OpenAI-Project": process.env.PROJECT
    },
});

ws.onerror = function (error) {
    console.error('WebSocket Error: ', error.message);
};

ws.on("open", function open() {
    console.log("Connected to server.");
    startConversation()
});

ws.on("message", incomingMessage);

2. Set up events constants & readline interface

Next, you need to set up a few events upon which you will identify the streaming results. The details around these events are listed within the OpenAI documentation here https://platform.openai.com/docs/api-reference/realtime-client-events/session-update.

The response.text.delta event specifies that your response is being generated (streamed), while the response.text.done specifies that your response has been received. Additionally, the readline interface is just a way to read user user prompts from command line.

import readline from 'readline';

const RESPONSE_TYPE_DELTA = "response.text.delta"
const RESPONSE_TYPE_DONE = "response.text.done"
let data = ""

const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout
});

3. Setting up your web socket callback methods

const startConversation = () => {
    rl.question('Enter your prompt here: ', (prompt) => {        
        /* Send Client Event */
        ws.send(JSON.stringify({
            type: "response.create",
            response: {
                modalities: ["text"],
                instructions: `You are a high school math tutor, 
                help the user with this questions - ${prompt}.`,
            }
        }));
    });  
}

const incomingMessage = (message) => {
    try{
        const response = JSON.parse(message.toString());
        if(response.type === RESPONSE_TYPE_DELTA) {
            const { delta } = response
            data += delta;
        }
        else if(response.type === RESPONSE_TYPE_DONE) {
            console.log(data);
            console.log("\n")
            data = ""
            startConversation()
        }
    }
    catch(ex){
        console.error(ex.toString)
    }
}

Final Thoughts

This API is set to be a game-changer, allowing developers to build apps where real-time, natural conversations with AI can transform user experiences across industries like customer service, health, education, and more.