TL;DR

If you have been searching for the new Grok speech to text or Grok text to speech APIs, the short answer is this:

On April 17, 2026, xAI turned Grok’s audio stack into two standalone APIs for developers: one for speech-to-text and one for text-to-speech.

That matters more than the product copy suggests.

xAI already had a Voice Agent API for real-time speech conversations. The April 17 release matters because it breaks out the stack into narrower building blocks that are easier to use when you do not want a full live agent loop. (xAI announcement, xAI Voice APIs docs)

The practical read is simple:

xAI is no longer only selling “talk to Grok.” It is selling separate transcription and synthesis primitives that developers can plug into support tools, accessibility features, audio workflows, and voice-enabled agents.

What xAI actually shipped on April 17

According to xAI’s April 17, 2026 announcement, the company launched:

  • Grok Speech to Text (STT)
  • Grok Text to Speech (TTS)

These are presented as standalone audio APIs built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. (xAI announcement)

The release is not a vague “audio support” update. The docs and announcement describe concrete product surfaces:

  • STT supports batch and streaming transcription
  • TTS supports REST and WebSocket generation
  • STT is exposed at /v1/stt
  • TTS is exposed at /v1/tts
  • xAI positions both as part of a broader Voice APIs family alongside the pre-existing Voice Agent API at /v1/realtime (xAI Voice APIs docs, xAI Text to Speech docs, xAI Speech to Text docs)

That separation matters because a lot of product teams do not need a fully conversational voice agent. They need one of these instead:

  • reliable transcription for meetings, calls, or uploads
  • speech generation for product narration or accessibility
  • audio I/O as a component inside an existing app workflow

Grok audio API decision map showing when to use speech-to-text, text-to-speech, or the realtime voice agent API

Why the release matters for developers

The interesting part is not that xAI now has audio. Most serious AI platforms already do.

The more important shift is that xAI is making its voice stack modular.

Before this, the cleaner way to think about xAI audio was “realtime voice agent.” After April 17, the better framing is “separate APIs for speech input, speech output, and full duplex voice agents.” That is a more useful product structure for actual software teams.

If you are building:

  • support and call-assist tools
  • voice notes or meeting capture
  • accessibility features
  • narration or podcast workflows
  • voice-enabled AI agents

this release is easier to adopt than a single monolithic voice endpoint.

It is the same broader shift we have been seeing across developer AI tooling: the products that win are increasingly the ones that expose smaller, composable pieces instead of forcing every workflow through one flagship interface. That is also why workflow scaffolding matters in tools like How to Use OpenAI Agents SDK with MCP and Approvals and How to Use LiteLLM with OpenAI, Claude, and Gemini.

What Grok Speech to Text includes

xAI’s docs describe the Speech to Text API as an audio-to-text service with both REST and streaming modes. The model/pricing page lists:

  • REST pricing: $0.10 per hour
  • Streaming pricing: $0.20 per hour
  • support for WAV, MP3, WebM, OGG, and M4A
  • multiple languages
  • real-time interim results for streaming
  • 100 concurrent streaming sessions per team (xAI Speech to Text docs)

The announcement adds a few details that matter in real workflows:

  • 25+ language support
  • multichannel transcription
  • diarization with word-level speaker IDs
  • formatting that can normalize spoken entities such as phone numbers and amounts into cleaner structured text (xAI announcement)

That makes the product more interesting than a bare transcript API.

For developer workflows, the useful cases are not just “transcribe audio.” They are:

  • turning support calls into structured notes
  • extracting cleaner data from spoken business workflows
  • powering live transcription in voice apps
  • separating speakers without building that layer yourself

What Grok Text to Speech includes

xAI’s Text to Speech docs describe a separate API for turning text into spoken audio with:

  • 5 voices
  • REST and WebSocket output
  • formats including MP3, WAV, PCM, mu-law, and A-law
  • a 15,000-character limit per unary request
  • support for 20 listed languages via BCP-47 codes
  • speech tags for delivery control such as pauses, emphasis, whispering, laughter, and pacing changes (xAI Text to Speech docs)

The available voices are named eve, ara, rex, sal, and leo, with xAI positioning them for different tones such as upbeat, conversational, professional, balanced, and authoritative. (xAI Text to Speech docs)

Pricing is also clear:

That part matters because voice pricing often gets murky fast. Here, the API shape is unusually legible: pay by audio hour for transcription, pay by character for synthesis.

The real differentiation is workflow fit, not just benchmarks

xAI’s announcement includes benchmark claims for STT against vendors such as ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, video or podcasts, and telephony. Those are worth noting, but they should not be the main reason you evaluate the product. (xAI announcement)

The stronger reason to care is workflow fit.

The April 17 release gives developers three separate ways to use xAI audio:

  • Speech to Text when input audio is the main problem
  • Text to Speech when output voice is the main problem
  • Voice Agent API when you need a live speech-to-speech loop with tools like web search and function calling (xAI Voice APIs docs, xAI Models and Pricing docs)

That is a more credible platform shape than “one voice demo that does everything.”

It also means teams can adopt xAI audio incrementally. You can start with transcription or synthesis first, then add a realtime agent layer later if you actually need it.

One useful nuance: this is not the same thing as the Voice Agent API

This point is easy to miss.

xAI already had a Voice Agent API available before this release. The docs describe it as a realtime voice interface with tool use and pricing of $0.05 per minute. (xAI Models and Pricing docs, xAI Voice APIs docs)

The April 17 story is different:

  • the new APIs are standalone
  • they target single-purpose audio workloads
  • they avoid forcing every use case through a conversational session model

That makes them easier to slot into existing products where you already have your own orchestration layer.

Where these APIs fit in real workflows

The best use cases look fairly practical.

Grok STT makes sense for:

  • customer support transcription
  • meeting or interview capture
  • speaker-separated call analytics
  • voice note ingestion
  • pre-processing audio before sending text into an agent framework

Grok TTS makes sense for:

  • accessibility and read-aloud features
  • narrated product flows
  • dynamic voice responses inside apps
  • podcast or video narration tooling
  • outbound voice layers for agents

The full xAI voice stack makes more sense when:

  • you need speech in and speech out
  • tool invocation matters during the conversation
  • you want a realtime assistant rather than a one-way media API

This is the distinction developers should focus on. Not every voice feature needs a voice agent. Sometimes you just need solid transcription or controllable output audio.

Should developers switch immediately?

Not automatically.

If you already have a stable audio stack with strong domain tuning, migration costs may outweigh the upside. The better question is whether xAI now deserves a place on your shortlist.

For many teams, the answer is yes, because the April 17 release finally gives xAI a more standard developer-facing audio shape:

  • standalone endpoints
  • straightforward pricing
  • batch and streaming modes
  • multilingual support
  • controllable speech output
  • a clear path up to a full voice agent product

That is enough to make it worth testing, especially if you are already evaluating xAI elsewhere in your stack.

Bottom line

The April 17, 2026 Grok audio release matters because xAI stopped treating voice as only a bundled agent experience.

By splitting the stack into Speech to Text, Text to Speech, and the existing Voice Agent API, xAI now has a more usable audio platform for developers building real products.

That does not make Grok the default choice for every voice workload. But it does make xAI materially more relevant for teams building:

  • voice-enabled agents
  • support tooling
  • accessibility features
  • transcription pipelines
  • audio-first product experiences

The right takeaway is not “xAI added another feature.”

It is that xAI now has a more credible developer audio stack than it did a week ago.

SEO FAQ

What are the Grok Speech to Text and Text to Speech APIs?

xAI launched two standalone APIs on April 17, 2026: one for speech-to-text (transcription) and one for text-to-speech (voice generation). These are separate endpoints from the main Grok chat API, designed for developers who need audio processing in their applications.

How do the Grok audio APIs compare to OpenAI Whisper and TTS?

Both OpenAI and xAI offer speech-to-text and text-to-speech APIs. Grok’s STT API supports multilingual transcription. OpenAI’s Whisper is widely adopted with strong community tooling. The choice depends on pricing, latency, language support, and integration with your existing stack.

What languages do the Grok speech APIs support?

The Grok STT API supports multiple languages for transcription. Check xAI’s official documentation for the current language list, as support may expand over time. TTS voice options and language coverage may differ from STT.

Can I use Grok speech APIs with the Grok chat API?

The speech APIs are separate endpoints from the Grok chat API. You can chain them together in your application logic — for example, transcribing audio with STT, processing the text through Grok chat, then converting the response back to speech with TTS — but they are billed and managed independently.

Sources

When to Use

Use this article when evaluating xAI’s audio APIs for voice assistants, transcription workflows, call summarization, or multimodal product prototypes. It is most useful as a checklist for comparing STT/TTS coverage, latency, pricing, language support, voice quality, and integration fit against OpenAI, Google, or other audio providers.

When Not to Use

Do not use this article as a substitute for a live API evaluation. Audio quality depends heavily on language, accent, noise, microphone quality, latency target, and voice UX expectations. Test representative samples, review data-handling terms, and avoid sending sensitive recordings until retention and access controls are clear.