Blockchain CouncilGlobal Technology Council
ai7 min read

Sarvam Bulbul V3

Michael WillsonMichael Willson
Sarvam Bulbul V3

Sarvam Bulbul V3 is a text-to-speech model built for Indian languages with one job: produce natural, expressive speech that holds up in real products, not just polished demos. It is positioned as production-ready for voice agents and app experiences where latency, consistency, and language handling actually matter. If you are building or evaluating systems like this, an AI certification helps because voice AI is equal parts model quality, API design, and deployment discipline.

What is Bulbul V3

Bulbul V3 is the latest “Bulbul” generation of Indic TTS, announced in early February 2026. The emphasis is on naturalness and expressiveness, with reliability as a first-class goal. That last part is important. Many TTS systems can sound good in a controlled example, then fall apart when faced with messy inputs like code-mixed text, numbers, abbreviations, and long prompts.

Bulbul V3 is framed as a voice model intended for real-world voice agents, customer support flows, education apps, and any experience where users will notice even small unnatural pauses or robotic prosody. The positioning is not “studio voiceover.” It is “speech you can ship.”

How to access it

Bulbul V3 is available through a Text to Speech API and also via a dashboard UI for generating speech. This matters for two different user types.

Developers typically want an API-first workflow where TTS is generated on demand, streamed, cached, or pre-generated at scale. Product and content teams often need a dashboard experience to try voices, validate pronunciations, and quickly generate clips for prototypes without touching code.

A limited-time free access window was promoted publicly for “the rest of the month” at the time of announcement, which is a common adoption pattern for new voice models. It gives teams a chance to test realistic workloads before committing to paid volume.

Languages supported

Sarvam Bulbul V3 supports 11 languages via the TTS API:

Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Punjabi, Odia, plus English with an Indian accent.

This is a practical coverage set for voice agents in India because it spans major language families and a wide geographic footprint. Including Indian-accent English also matters, since many real scripts and user journeys mix English terms into otherwise Indic speech.

Voices and speech quality claims

Sarvam Bulbul V3 is described as offering 30+ voices and high-quality natural speech synthesis for Indian languages. More voices is not just about variety. In production, it helps teams choose voices aligned to brand, region, and use case. A banking assistant voice and an education tutor voice should not sound identical unless the product wants that uncanny sameness.

There is also mention of an external blind listening study run by Josh Talks that compared Bulbul V3 with other TTS systems, referenced in public communications by the founder. The key point is not the marketing line. It is that subjective quality in TTS is often validated through blind tests, because raw metrics rarely capture human perception of naturalness.

Model identifiers

In the REST documentation for TTS conversion, two models are explicitly listed:

bulbul:v3 is the latest model with improved quality, 30+ voices, and temperature control.

bulbul:v2 is the legacy option with pitch and loudness controls.

This split is useful when you care about fine-grained acoustic knobs. V3 focuses on quality and expressiveness with a simpler control surface, while V2 keeps explicit pitch and loudness parameters for teams that built pipelines around those settings.

Controls and parameters

Bulbul V3 supports pace control with a range from 0.5 to 2.0. This is a practical range for production. Slower pacing improves intelligibility for educational or accessibility use cases. Faster pacing reduces audio duration for navigation prompts and routine confirmations.

Bulbul V3 also includes temperature control. In TTS, temperature usually affects variation in prosody and expression. In product terms, it can help you find a balance between “consistent” and “too monotone,” depending on whether your application needs strict repeatability or a more human feel.

There are also important limitations called out: pitch and loudness parameters are not supported in V3. If a product workflow depends on those, V2 remains relevant.

Automatic preprocessing is enabled by default. This is a subtle but crucial production feature. Preprocessing can normalize punctuation, spacing, special symbols, and other text artifacts that make models stumble. When a model is intended for voice agents, preprocessing usually reduces the number of weird mispronunciations that show up only after launch.

Audio sample rates

Sarvam Bulbul V3 has a default sample rate of 24,000 Hz. It supports 8,000, 16,000, 22,050, and 24,000 Hz. The REST documentation also lists additional higher sample rates including 32,000, 44,100, and 48,000 Hz.

This range covers most production needs:

8 kHz and 16 kHz are common for telephony, call centers, and voice bots over phone networks.

22,050 Hz and 24,000 Hz are typical for app experiences and general speech playback.

44,100 Hz and 48,000 Hz are standard for high-quality audio pipelines and media workflows.

Having multiple sample rates means teams can avoid unnecessary resampling, which can add artifacts and operational complexity.

Input limits and long prompts

Bulbul V3 supports a maximum input length of 2,500 characters. The earlier V2 limit is 1,500 characters.

This change matters for real scripts. Voice agents often need to read longer policy statements, step-by-step instructions, or multi-sentence explanations. A higher character cap reduces the need to split text into chunks, which can introduce unnatural breaks if stitching is not handled carefully.

Code-mixed text support

The documentation explicitly calls out code-mixed text support, meaning English mixed with Indic languages. This is not optional in Indian user experiences. People frequently mix English nouns, app names, product terms, and numbers into otherwise native-language sentences.

In production, code-mixing support reduces failures like awkward phonetics for English terms or incorrect handling of mixed punctuation and abbreviations.

Language parameter requirements

Every TTS request requires a target language code, using BCP-47 language tags. This requirement is not just formality. It enables language-specific handling such as:

Correct reading of numbers.

Handling abbreviations and special characters.

Applying language-appropriate normalization rules.

In voice products, incorrect language selection is one of the fastest ways to get embarrassing output. The mandatory language parameter forces the caller to be explicit, which improves reliability.

Pricing

Public pricing lists Bulbul v3 Beta at ₹30 per 10,000 characters. Bulbul v2 is listed at ₹15 per 10,000 characters.

This pricing structure implies V3 is positioned as the higher-quality, higher-cost option, while V2 remains a lower-cost alternative for simpler workloads or legacy integrations. For teams, the decision becomes a straightforward trade-off: pay more for better speech quality and modern behavior, or pay less if the older control knobs and cheaper throughput are more important.

A Tech certification can help teams evaluate those trade-offs realistically by tying model and API choices to deployment constraints like latency, caching strategy, and telephony compatibility. You can explore that broader engineering perspective through a Tech certification.

Real-world examples

A multilingual customer support voice agent can use Sarvam Bulbul V3 to speak Hindi, Tamil, and Bengali in the same product while keeping accent and cadence natural per language. With required BCP-47 language codes, the application can route language selection based on user preference and avoid mixing rules accidentally.

An education app can slow pace to improve comprehension for language learners, while using higher sample rates for a cleaner listening experience on mobile speakers.

A commerce app can generate order confirmations and delivery instructions where code-mixed text is unavoidable, such as English product names inside Marathi sentences, without turning the output into a phonetic trainwreck.

On the business side, a Marketing certification helps teams design voice personas, scripts, and user journeys that sound credible and consistent rather than “generic bot reading text.” That perspective matters when voice is part of brand identity. A useful reference point is a Marketing certification.

Conclusion

Bulbul V3 is a deliberate push toward Indic TTS that is built to be deployed, not admired. The combination of 11-language coverage, 30+ voices, pace and temperature controls, code-mixing support, larger input limits, and flexible sample rates targets the real problems teams hit after launch. Its pricing and “Beta” label also signal a clear product reality: you can adopt it now, but you should test it against your own scripts, channels, and latency budgets before scaling.

If you are building voice experiences for India, Bulbul V3 is not interesting because it exists. It is interesting because its API and feature choices are aimed at making voice agents sound consistent, natural, and usable in the messy conditions of real users and real text.

Sarvam Bulbul V3