Text-to-Speech

POST /v1/audio/speech

Follow the quickstart guide to get your own API Key.

Replace $YOUR_API_KEY with the actual API key you generated in the previous step.

Make sure to replace $AUDIO_SAMPLE_URL with the voice sample URL, and $OUTPUT_PATH with the output file path.

Request body

model string Required
Model ID used to generate the speech output.

This field supports two types of models:

Voice Create models: for converting plain text into natural speech.
Voice Cloning models: for generating speech that mimics a specific speaker's voice.

You can use either type of model.

text string Required
The text you want to convert into speech. Recommended length: up to 200 characters for optimal performance.

language string Optional Defaults to zh-cn

Language code of the input text, like zh-cn (Chinese), en (English), etc. Defaults to zh-cn.

sample_rate integer Optional Defaults to 8000

Audio sample rate in Hz. Common values: 8000, 12000, 16000, 24000, 48000. Defaults to 8000.

format string Optional Defaults to wav

Output audio format. Typically supports: wav, mp3, ogg. Defaults to wav.

stream boolean Optional Defaults to false

When set to true, the API will return the generated audio in a streaming format. This allows the client to begin receiving and playing audio as it is being synthesized, reducing latency. When set to false or omitted, the API will return the full audio file only after synthesis is complete. Default to false.

speed float Optional Defaults to 1.0

Controls the speaking speed of the generated audio. Acceptable range: typically from 0.0 to 2.0. Default to 1.0.

instruct_text string Optional Defaults to ""

A natural language instruction that guides the model's tone, emotion, speaking rate, dialect, or role-playing style. To ensure optimal audio quality, please include at least one of the officially supported keywords listed below.

Supported Tags:

・"Emotion: Happy, Sad, Surprised, Angry, Fearful, Disgusted, Calm, Serious"

・"Speaking Rate: Fast, Very Fast, Slow, Very Slow"

・"Dialect: Cantonese, Sichuanese, Shanghainese, Zhengzhou Dialect, Changsha Dialect, Tianjin Dialect, etc."

・"Role-playing: Mysterious, Fierce, Curious, Elegant, Lonely, Robot, Peppa Pig, etc."

Example values:

・"Speak in a happy tone"

・"Use a very slow and calm voice"

・"Speak like Peppa Pig"

・"Use a Sichuan dialect with a curious tone"

Default to ""(no instruction).

prompt_speech string Optional

The URL of a reference audio file. The model will mimic the voice characteristics (such as tone, timbre, and speaking style) of the provided sample for speech synthesis.

The audio must be accessible via a public URL and should be in standard formats such as .mp3, .wav, etc.

Note：This field only takes effect when the selected model supports Voice Cloning. It has no effect for Voice Create models.

If not provided, the model will use a default voice.

Returns

When using curl, if you do not specify the --output option, the binary response will be printed to your terminal, potentially causing display issues.

To avoid this, use: "--output $OUTPUT_PATH" to save to a file.