Skip to main content

Use voice mode

You can use Langflow's voice mode to interact with your flows verbally through a microphone and speakers.

Prerequisites

Voice mode requires the following:

  • A flow with Chat Input, Language Model, and Chat Output components.

    If your flow has an Agent component, make sure the tools in your flow have accurate names and descriptions to help the agent choose which tools to use.

    Additionally, be aware that voice mode overrides typed instructions in the Agent component's Agent Instructions field.

  • An OpenAI account and an OpenAI API key because Langflow uses the OpenAI API to process voice input and generate responses.

  • Optional: An ElevenLabs API key to enable voice options for the LLM's response.

  • A microphone and speakers.

    A high quality microphone and minimal background noise are recommended for optimal voice comprehension.

Test voice mode in the Playground

In the Playground, click the Microphone to enable voice mode and verbally interact with your flows through a microphone and speakers.

The following steps use the Simple Agent template to demonstrate how to enable voice mode:

  1. Create a flow based on the Simple Agent template.

  2. Add your OpenAI API key credentials to the Agent component.

  3. Click Playground.

  4. Click the Microphone icon to open the Voice mode dialog.

  5. Enter your OpenAI API key, and then click Save. Langflow saves the key as a global variable.

  6. If you are prompted to grant microphone access, you must allow microphone access to use voice mode. If microphone access is blocked, you won't be able to provide verbal input.

  7. For Audio Input, select the input device to use with voice mode.

  8. Optional: Add an ElevenLabs API key to enable more voices for the LLM's response. Langflow saves this key as a global variable.

  9. For Preferred Language, select the language you want to use for your conversations with the LLM. This option changes both the expected input language and the response language.

  10. Speak into your microphone to start the chat.

    If configured correctly, the waveform in the voice mode dialog registers your input, and then the agent's logic and response are described verbally and in the Playground.

Develop applications with websockets endpoints

Langflow exposes two OpenAI Realtime API-compatible websocket endpoints for your flows. You can build applications against these endpoints the same way you would build against OpenAI Realtime API websockets.

The Langflow API's websockets endpoints require an OpenAI API key for authentication, and they support an optional ElevenLabs integration with an ElevenLabs API key.

Additionally, both endpoints require that you provide the flow ID in the endpoint path.

Voice-to-voice audio streaming

The /ws/flow_as_tool/$FLOW_ID endpoint establishes a connection to OpenAI Realtime voice, and then invokes the specified flow as a tool according to the OpenAI Realtime model.

This approach is ideal for low latency applications, but it is less deterministic because the OpenAI voice-to-voice model determines when to call your flow.

Speech-to-text audio transcription

The /ws/flow_tts/$FLOW_ID endpoint converts audio to text using OpenAI Realtime voice transcription, and then directly invokes the specified flow for each transcript.

This approach is more deterministic but has higher latency.

This is the mode used in the Langflow Playground.

Session IDs for websockets endpoints

Both endpoints accept an optional /$SESSION_ID path parameter to provide a unique ID for the conversation. If omitted, Langflow uses the flow ID as the session ID.

However, be aware that voice mode only maintains context within the current conversation instance. When you close the Playground or end a chat, verbal chat history is discarded and not available for future chat sessions.

Search