If you have been searching for the new Grok speech to text or Grok text to speech APIs, the short answer is this:
On April 17, 2026, xAI turned Grok’s audio stack into two standalone APIs for developers: one for speech-to-text and one for text-to-speech.
That matters more than the product copy suggests.
xAI already had a Voice Agent API for real-time speech conversations. The April 17 release matters because it breaks out the stack into narrower building blocks that are easier to use when you do not want a full live agent loop. (xAI announcement, xAI Voice APIs docs)
The practical read is simple:
xAI is no longer only selling “talk to Grok.” It is selling separate transcription and synthesis primitives that developers can plug into support tools, accessibility features, audio workflows, and voice-enabled agents.
What xAI actually shipped on April 17
According to xAI’s April 17, 2026 announcement, the company launched:
- Grok Speech to Text (STT)
- Grok Text to Speech (TTS)
These are presented as standalone audio APIs built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. (xAI announcement)
The release is not a vague “audio support” update. The docs and announcement describe concrete product surfaces:
- STT supports batch and streaming transcription
- TTS supports REST and WebSocket generation
- STT is exposed at
/v1/stt - TTS is exposed at
/v1/tts - xAI positions both as part of a broader Voice APIs family alongside the pre-existing Voice Agent API at
/v1/realtime(xAI Voice APIs docs, xAI Text to Speech docs, xAI Speech to Text docs)
That separation matters because a lot of product teams do not need a fully conversational voice agent. They need one of these instead:
- reliable transcription for meetings, calls, or uploads
- speech generation for product narration or accessibility
- audio I/O as a component inside an existing app workflow
Why the release matters for developers
The interesting part is not that xAI now has audio. Most serious AI platforms already do.
The more important shift is that xAI is making its voice stack modular.
Before this, the cleaner way to think about xAI audio was “realtime voice agent.” After April 17, the better framing is “separate APIs for speech input, speech output, and full duplex voice agents.” That is a more useful product structure for actual software teams.
If you are building:
- support and call-assist tools
- voice notes or meeting capture
- accessibility features
- narration or podcast workflows
- voice-enabled AI agents
this release is easier to adopt than a single monolithic voice endpoint.
It is the same broader shift we have been seeing across developer AI tooling: the products that win are increasingly the ones that expose smaller, composable pieces instead of forcing every workflow through one flagship interface. That is also why workflow scaffolding matters in tools like How to Use OpenAI Agents SDK with MCP and Approvals and How to Use LiteLLM with OpenAI, Claude, and Gemini.
What Grok Speech to Text includes
xAI’s docs describe the Speech to Text API as an audio-to-text service with both REST and streaming modes. The model/pricing page lists:
- REST pricing: $0.10 per hour
- Streaming pricing: $0.20 per hour
- support for WAV, MP3, WebM, OGG, and M4A
- multiple languages
- real-time interim results for streaming
- 100 concurrent streaming sessions per team (xAI Speech to Text docs)
The announcement adds a few details that matter in real workflows:
- 25+ language support
- multichannel transcription
- diarization with word-level speaker IDs
- formatting that can normalize spoken entities such as phone numbers and amounts into cleaner structured text (xAI announcement)
That makes the product more interesting than a bare transcript API.
For developer workflows, the useful cases are not just “transcribe audio.” They are:
- turning support calls into structured notes
- extracting cleaner data from spoken business workflows
- powering live transcription in voice apps
- separating speakers without building that layer yourself
What Grok Text to Speech includes
xAI’s Text to Speech docs describe a separate API for turning text into spoken audio with:
- 5 voices
- REST and WebSocket output
- formats including MP3, WAV, PCM, mu-law, and A-law
- a 15,000-character limit per unary request
- support for 20 listed languages via BCP-47 codes
- speech tags for delivery control such as pauses, emphasis, whispering, laughter, and pacing changes (xAI Text to Speech docs)
The available voices are named eve, ara, rex, sal, and leo, with xAI positioning them for different tones such as upbeat, conversational, professional, balanced, and authoritative. (xAI Text to Speech docs)
Pricing is also clear:
- $4.20 per 1 million characters for TTS (xAI announcement, xAI Models and Pricing docs)
That part matters because voice pricing often gets murky fast. Here, the API shape is unusually legible: pay by audio hour for transcription, pay by character for synthesis.
The real differentiation is workflow fit, not just benchmarks
xAI’s announcement includes benchmark claims for STT against vendors such as ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, video or podcasts, and telephony. Those are worth noting, but they should not be the main reason you evaluate the product. (xAI announcement)
The stronger reason to care is workflow fit.
The April 17 release gives developers three separate ways to use xAI audio:
- Speech to Text when input audio is the main problem
- Text to Speech when output voice is the main problem
- Voice Agent API when you need a live speech-to-speech loop with tools like web search and function calling (xAI Voice APIs docs, xAI Models and Pricing docs)
That is a more credible platform shape than “one voice demo that does everything.”
It also means teams can adopt xAI audio incrementally. You can start with transcription or synthesis first, then add a realtime agent layer later if you actually need it.
One useful nuance: this is not the same thing as the Voice Agent API
This point is easy to miss.
xAI already had a Voice Agent API available before this release. The docs describe it as a realtime voice interface with tool use and pricing of $0.05 per minute. (xAI Models and Pricing docs, xAI Voice APIs docs)
The April 17 story is different:
- the new APIs are standalone
- they target single-purpose audio workloads
- they avoid forcing every use case through a conversational session model
That makes them easier to slot into existing products where you already have your own orchestration layer.
Where these APIs fit in real workflows
The best use cases look fairly practical.
Grok STT makes sense for:
- customer support transcription
- meeting or interview capture
- speaker-separated call analytics
- voice note ingestion
- pre-processing audio before sending text into an agent framework
Grok TTS makes sense for:
- accessibility and read-aloud features
- narrated product flows
- dynamic voice responses inside apps
- podcast or video narration tooling
- outbound voice layers for agents
The full xAI voice stack makes more sense when:
- you need speech in and speech out
- tool invocation matters during the conversation
- you want a realtime assistant rather than a one-way media API
This is the distinction developers should focus on. Not every voice feature needs a voice agent. Sometimes you just need solid transcription or controllable output audio.
Should developers switch immediately?
Not automatically.
If you already have a stable audio stack with strong domain tuning, migration costs may outweigh the upside. The better question is whether xAI now deserves a place on your shortlist.
For many teams, the answer is yes, because the April 17 release finally gives xAI a more standard developer-facing audio shape:
- standalone endpoints
- straightforward pricing
- batch and streaming modes
- multilingual support
- controllable speech output
- a clear path up to a full voice agent product
That is enough to make it worth testing, especially if you are already evaluating xAI elsewhere in your stack.
Bottom line
The April 17, 2026 Grok audio release matters because xAI stopped treating voice as only a bundled agent experience.
By splitting the stack into Speech to Text, Text to Speech, and the existing Voice Agent API, xAI now has a more usable audio platform for developers building real products.
That does not make Grok the default choice for every voice workload. But it does make xAI materially more relevant for teams building:
- voice-enabled agents
- support tooling
- accessibility features
- transcription pipelines
- audio-first product experiences
The right takeaway is not “xAI added another feature.”
It is that xAI now has a more credible developer audio stack than it did a week ago.