Voice Comes to the Backboard API

Backboard now supports voice in the API, letting developers add STT, TTS, and full audio conversations to existing assistants and threads using OpenAI or ElevenLabs.

Backboard has been a clean way to orchestrate LLMs across providers. Until now, that has meant text.

Today we are adding first class voice support to the Backboard API:

Speech to Text (STT) with OpenAI and ElevenLabs
Text to Speech (TTS) with OpenAI and ElevenLabs
Three simple modes:
1. Audio in → text out
2. Text in → audio out
3. Audio in → audio out

You do not wire a new API. You keep using add_message / addMessage, add a voice object, and (for STT) an audio_file / audioFile. The SDKs handle multipart uploads, streaming, and provider quirks for you.

Voice is just another message capability

The core design choice: voice is part of the same assistants and threads model you already use.

You still:

Create an assistant
Create a thread
Call add_message / addMessage

Now you can include:

voice.stt to transcribe user audio
voice.tts to speak the LLM reply
audio_file / audioFile pointing at a local file for STT

Backboard runs the pipeline:

STT with your chosen provider and model
Routes the transcript into your configured LLM
Optional TTS on the reply
Returns text plus a voice_records object with transcripts, audio URLs, durations, and token usage

You can even mix providers, like ElevenLabs STT with OpenAI TTS, in a single request.

Providers, models, and modes

Providers and models

Speech to Text (STT)

OpenAI: whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize
ElevenLabs: scribe_v1, scribe_v2

Text to Speech (TTS)

OpenAI: tts-1, tts-1-hd, gpt-4o-mini-tts
ElevenLabs: eleven_v3, eleven_multilingual_v2, eleven_flash_v2_5, eleven_turbo_v2_5, and more

You select provider and model per message, per direction.

Three modes in a single API

Mode 1: STT + LLM

Audio in, text out

Send an audio file, get a transcript and the LLM reply.

Use this for call summaries, voice notes to tasks, or “talk instead of type” UX.

Mode 2: LLM + TTS

Text in, audio out

Send text, get the LLM reply plus a presigned audio URL.

Great for voice enabled agents, read aloud summaries, and accessibility.

Mode 3: STT + LLM + TTS

Audio in, audio out

Full voice conversation in a single call.

This is the fastest path from “text assistant” to “phone like conversation.”

Streaming without rebuilding your stack

You can stream voice pipelines over the same API by setting stream=True.

Backboard emits:

STT events: stt_stream_start, stt_text_delta, stt_stream_end
LLM events: content_streaming
TTS events: tts_stream_start, tts_audio_chunk, tts_stream_end

Example: show STT and LLM deltas, and play TTS audio as it arrives:

For the special case of real time microphone audio to ElevenLabs scribe_v2_realtime, we also expose a dedicated Real Time Voice WebSocket API. For everything else, add_message streaming is enough.

Provider options without vendor lock in

You still have access to provider specific features through provider_options, without learning each raw API.

Examples:

OpenAI STT: response_format, timestamp_granularities, temperature, prompt
ElevenLabs STT: diarize, num_speakers, and event tagging
OpenAI TTS: speed, instructions for tone
ElevenLabs TTS: voice_settings, language_code, and more

Backboard normalizes everything into voice_records.stt and voice_records.tts, with provider_output available if you need the raw response.

Audio formats, languages, and limits

STT input formats:

OpenAI: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm up to 25 MB
ElevenLabs: all major audio and video formats up to 3 GB

TTS output formats:

OpenAI: mp3 (default), opus, aac, flac, wav, pcm
ElevenLabs: mp3_*, pcm_*, opus_*, wav_*, ulaw_*, alaw_* variants

Languages use ISO 639 1 codes like en, es, fr. Pass language to force one, or omit it to let providers auto detect.

Note: OpenAI requires you to clearly tell users that OpenAI TTS voices are AI generated, not real people. Make sure your UX and terms reflect that.

Why we built it this way

This release follows the same principles as the rest of Backboard:

One API, many providers
You can swap or mix OpenAI and ElevenLabs without rewriting your application.
Pipelines are first class
STT, LLM, TTS, and streaming events are modeled as a single pipeline, not three separate integrations.
Cost and token visibility
voice_records include transcript length, durations, token counts, and audio tokens so you can reason about usage instead of guessing.
Same data posture across modalities
Voice data follows the same rules as text. We do not train on your or your users’ content.

Getting started

If you are already on the Backboard SDKs:

Upgrade to the latest Python or JS/TS SDK.
Add a voice object to your next add_message / addMessage call.
Include audio_file / audioFile for STT.
Turn on stream=True if you want deltas and audio chunks.

If you are not using Backboard yet and want to experiment with voice, reach out for access and sample projects. You can start with a simple “audio in, text out” assistant and grow into full voice conversations once it proves useful.

If you tell me your preferred length (for example, “half as long” or “make this a 1 minute read”) I can cut this down and adjust the intro to match your brand voice.

No headings found on page

Introducing Backboard’s Image Tool: Streaming-native Image Generation for Your Agents