Grok Speech to Text and Text to Speech APIs: What Changed

If you have been searching for the new Grok speech to text or Grok text to speech APIs, the short answer is this:

On April 17, 2026, xAI turned Grok’s audio stack into two standalone APIs for developers: one for speech-to-text and one for text-to-speech.

That matters more than the product copy suggests.

xAI already had a Voice Agent API for real-time speech conversations. The April 17 release matters because it breaks out the stack into narrower building blocks that are easier to use when you do not want a full live agent loop. (xAI announcement, xAI Voice APIs docs)

The practical read is simple:

xAI is no longer only selling “talk to Grok.” It is selling separate transcription and synthesis primitives that developers can plug into support tools, accessibility features, audio workflows, and voice-enabled agents.

What xAI actually shipped on April 17

According to xAI’s April 17, 2026 announcement, the company launched:

Grok Speech to Text (STT)
Grok Text to Speech (TTS)

These are presented as standalone audio APIs built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. (xAI announcement)

The release is not a vague “audio support” update. The docs and announcement describe concrete product surfaces:

STT supports batch and streaming transcription
TTS supports REST and WebSocket generation
STT is exposed at /v1/stt
TTS is exposed at /v1/tts
xAI positions both as part of a broader Voice APIs family alongside the pre-existing Voice Agent API at /v1/realtime (xAI Voice APIs docs, xAI Text to Speech docs, xAI Speech to Text docs)

That separation matters because a lot of product teams do not need a fully conversational voice agent. They need one of these instead:

reliable transcription for meetings, calls, or uploads
speech generation for product narration or accessibility
audio I/O as a component inside an existing app workflow

Why the release matters for developers

The interesting part is not that xAI now has audio. Most serious AI platforms already do.

The more important shift is that xAI is making its voice stack modular.

Before this, the cleaner way to think about xAI audio was “realtime voice agent.” After April 17, the better framing is “separate APIs for speech input, speech output, and full duplex voice agents.” That is a more useful product structure for actual software teams.

If you are building:

support and call-assist tools
voice notes or meeting capture
accessibility features
narration or podcast workflows
voice-enabled AI agents

this release is easier to adopt than a single monolithic voice endpoint.

It is the same broader shift we have been seeing across developer AI tooling: the products that win are increasingly the ones that expose smaller, composable pieces instead of forcing every workflow through one flagship interface. That is also why workflow scaffolding matters in tools like How to Use OpenAI Agents SDK with MCP and Approvals and How to Use LiteLLM with OpenAI, Claude, and Gemini.

What Grok Speech to Text includes

xAI’s docs describe the Speech to Text API as an audio-to-text service with both REST and streaming modes. The model/pricing page lists:

REST pricing: $0.10 per hour
Streaming pricing: $0.20 per hour
support for WAV, MP3, WebM, OGG, and M4A
multiple languages
real-time interim results for streaming
100 concurrent streaming sessions per team (xAI Speech to Text docs)

The announcement adds a few details that matter in real workflows:

25+ language support
multichannel transcription
diarization with word-level speaker IDs
formatting that can normalize spoken entities such as phone numbers and amounts into cleaner structured text (xAI announcement)

That makes the product more interesting than a bare transcript API.

For developer workflows, the useful cases are not just “transcribe audio.” They are:

turning support calls into structured notes
extracting cleaner data from spoken business workflows
powering live transcription in voice apps
separating speakers without building that layer yourself

What Grok Text to Speech includes

xAI’s Text to Speech docs describe a separate API for turning text into spoken audio with:

5 voices
REST and WebSocket output
formats including MP3, WAV, PCM, mu-law, and A-law
a 15,000-character limit per unary request
support for 20 listed languages via BCP-47 codes
speech tags for delivery control such as pauses, emphasis, whispering, laughter, and pacing changes (xAI Text to Speech docs)

The available voices are named eve, ara, rex, sal, and leo, with xAI positioning them for different tones such as upbeat, conversational, professional, balanced, and authoritative. (xAI Text to Speech docs)

Pricing is also clear:

$4.20 per 1 million characters for TTS (xAI announcement, xAI Models and Pricing docs)

That part matters because voice pricing often gets murky fast. Here, the API shape is unusually legible: pay by audio hour for transcription, pay by character for synthesis.

The real differentiation is workflow fit, not just benchmarks

xAI’s announcement includes benchmark claims for STT against vendors such as ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, video or podcasts, and telephony. Those are worth noting, but they should not be the main reason you evaluate the product. (xAI announcement)

The stronger reason to care is workflow fit.

The April 17 release gives developers three separate ways to use xAI audio:

Speech to Text when input audio is the main problem
Text to Speech when output voice is the main problem
Voice Agent API when you need a live speech-to-speech loop with tools like web search and function calling (xAI Voice APIs docs, xAI Models and Pricing docs)

That is a more credible platform shape than “one voice demo that does everything.”

It also means teams can adopt xAI audio incrementally. You can start with transcription or synthesis first, then add a realtime agent layer later if you actually need it.

One useful nuance: this is not the same thing as the Voice Agent API

This point is easy to miss.

xAI already had a Voice Agent API available before this release. The docs describe it as a realtime voice interface with tool use and pricing of $0.05 per minute. (xAI Models and Pricing docs, xAI Voice APIs docs)

The April 17 story is different:

the new APIs are standalone
they target single-purpose audio workloads
they avoid forcing every use case through a conversational session model

That makes them easier to slot into existing products where you already have your own orchestration layer.

Where these APIs fit in real workflows

The best use cases look fairly practical.

Grok STT makes sense for:

customer support transcription
meeting or interview capture
speaker-separated call analytics
voice note ingestion
pre-processing audio before sending text into an agent framework

Grok TTS makes sense for:

accessibility and read-aloud features
narrated product flows
dynamic voice responses inside apps
podcast or video narration tooling
outbound voice layers for agents

The full xAI voice stack makes more sense when:

you need speech in and speech out
tool invocation matters during the conversation
you want a realtime assistant rather than a one-way media API

This is the distinction developers should focus on. Not every voice feature needs a voice agent. Sometimes you just need solid transcription or controllable output audio.

Should developers switch immediately?

Not automatically.

If you already have a stable audio stack with strong domain tuning, migration costs may outweigh the upside. The better question is whether xAI now deserves a place on your shortlist.

For many teams, the answer is yes, because the April 17 release finally gives xAI a more standard developer-facing audio shape:

standalone endpoints
straightforward pricing
batch and streaming modes
multilingual support
controllable speech output
a clear path up to a full voice agent product

That is enough to make it worth testing, especially if you are already evaluating xAI elsewhere in your stack.

Bottom line

The April 17, 2026 Grok audio release matters because xAI stopped treating voice as only a bundled agent experience.

By splitting the stack into Speech to Text, Text to Speech, and the existing Voice Agent API, xAI now has a more usable audio platform for developers building real products.

That does not make Grok the default choice for every voice workload. But it does make xAI materially more relevant for teams building:

voice-enabled agents
support tooling
accessibility features
transcription pipelines
audio-first product experiences

The right takeaway is not “xAI added another feature.”

It is that xAI now has a more credible developer audio stack than it did a week ago.

Grok Speech to Text and Text to Speech APIs: What xAI Shipped on April 17, 2026

What xAI actually shipped on April 17

Why the release matters for developers

What Grok Speech to Text includes

What Grok Text to Speech includes

The real differentiation is workflow fit, not just benchmarks

One useful nuance: this is not the same thing as the Voice Agent API

Where these APIs fit in real workflows

Grok STT makes sense for:

Grok TTS makes sense for:

The full xAI voice stack makes more sense when:

Should developers switch immediately?

Bottom line

Sources

Charles Jasthyn De La Cueva / Founder of Open-TechStack

Grok Speech to Text and Text to Speech APIs: What xAI Shipped on April 17, 2026

What xAI actually shipped on April 17

Why the release matters for developers

What Grok Speech to Text includes

What Grok Text to Speech includes

The real differentiation is workflow fit, not just benchmarks

One useful nuance: this is not the same thing as the Voice Agent API

Where these APIs fit in real workflows

Grok STT makes sense for:

Grok TTS makes sense for:

The full xAI voice stack makes more sense when:

Should developers switch immediately?

Bottom line

Sources

Charles Jasthyn De La Cueva / Founder of Open-TechStack

More in AI Tools

Claude Cowork Is Now Generally Available: What Changed on April 9, 2026

Claude Opus 4.7: What Changed on April 16, 2026 and Why It Matters for Developers

Cloudflare Browser Run: What Changed on April 15, 2026 and Why It Matters for AI Agents

Get the Open-TechStack Newsletter

You're on the list!