Speech Recognition API Reference
Deepgram's Speech Recognition API gives you streamlined access to automatic transcription from Deepgram's off-the-shelf and trained speech recognition models. This feature is very fast, can understand nearly every audio format available, and is customizable. You can customize your transcript using various query parameters and apply general purpose and custom-trained AI models.
Endpoint
https://brain.deepgram.com/v2
Authentication
All requests to the API should include a basic Authorization
header that references the Base64-encoded username (or email
address you used to sign up) and password of your Deepgram account.
For example, for user gandalf
with password mellon
, the
base64-encoded value of gandalf:mellon
is
Z2FuZGFsZjptZWxsb24=
. So Gandalf's requests to the Deepgram API
should all include the following header:
Authorization: Basic Z2FuZGFsZjptZWxsb24=
.
Security Scheme Type | HTTP |
---|---|
HTTP Authorization Scheme | basic |
Transcription
High-speed transcription of either pre-recorded or streaming audio. This feature is very fast, can understand nearly every audio format available, and is customizable. You can customize your transcript using various query parameters and apply general purpose and custom-trained AI models.
Deepgram supports over 100 different audio formats and encodings. For example, some of the most common audio formats and encodings we support include MP3, MP4, MP2, AAC, WAV, FLAC, PCM, M4A, Ogg, Opus, and WebM. However, because audio format is largely unconstrained, we always recommend to ensure compatibility by testing small sets of audio when first operating with new audio sources.
Transcribe Pre-recorded Audio
Transcribes the specified audio file.
Query Parameters
AI model used to process submitted audio. Off-the-shelf Deepgram models include:
- general: (Default)
Optimized for everyday audio processing; if you aren't sure which model to select, start here. - meeting:
Optimized for conference room settings, which include multiple speakers with a single microphone. - phonecall:
Optimized for low-bandwidth audio phone calls. - voicemail:
Optimized for low-bandwidth audio clips with a single speaker. Derived from thephonecall
model. - finance:
Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented. - conversationalai:
Optimized to allow artificial intelligence technologies, such as chatbots, to interact with people in a human-like way. - video:
Optimized for audio sourced from videos. - <custom_id>:
You may also use a custom model associated with your account by including itscustom_id
.
To learn more, see Features: Model.
Version of the model to use.
Default: latest
Possible values: latest OR <version_id>
To learn more, see Features: Version.
BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:
Dutch
- nl: beta
MODELS: general
English
- en-US: United States (Default)
MODELS: general, meeting, phonecall, voicemail, finance, conversationalai, video - en-AU: Australia beta
MODELS: general - en-GB: United Kingdom
MODELS: general, phonecall - en-IN: India
MODELS: general, phonecall - en-NZ: New Zealand
MODELS: general
French
- fr:
MODELS: general - fr-CA: Canada
MODELS: general
German
- de:
MODELS: general
Hindi
- hi:
MODELS: general
Italian
- it: beta
MODELS: general
Japanese
- ja: beta
MODELS: general
Korean
- ko: beta
MODELS: general
Portuguese
- pt:
MODELS: general - pt-BR: Brazil
MODELS: general
Russian
- ru:
MODELS: general
Spanish
- es:
MODELS: general - es-419: Latin America beta
MODELS: general
Swedish
- sv: beta
MODELS: general
Turkish
- tr:
MODELS: general
To learn more, see Features: Language.
Indicates whether to add punctuation and capitalization to the transcript. To learn more, see Features: Punctuation.
Indicates whether to remove profanity from the transcript. To learn more, see Features: Profanity Filter.
Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:
- pci:
Redacts sensitive credit card information, including credit card number, expiration date, and CVV. - numbers: (or true)
Aggressively redacts strings of numerals. - ssn: beta
Redacts social security numbers.
Can send multiple instances in query string (for example,
redact=pci&redact=numbers
). When sending multiple values, redaction occurs in the
order you specify. For instance, in this example, sensitive credit card information would be redacted
first, then strings
of numbers.
To learn more, see Features: Redaction.
Indicates whether to recognize speaker changes. When set to
true
, each word in the transcript will be assigned a speaker number starting at 0. To learn more, see
Features: Diarization.
Indicates whether to recognize alphanumeric strings. When set to
true
, whitespace will be removed between characters identified as part of an alphanumeric
string. To learn more, see
Features: Named-Entity Recognition (NER).
Indicates whether to transcribe each audio channel independently. When set to
true
, you will receive one transcript for each channel, which means you can apply a
different model to each channel using the model parameter (for example, set
model
to
general:phonecall
, which applies the
general
model to channel 0 and the
phonecall
model to channel 1).
To learn more, see Features: Multichannel.
Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears.
Default: 1
Indicates whether to convert numbers from written format (for example, one) to numerical format (for example, 1).
Deepgram can format numbers up to 999,999.
Converted numbers do not include punctuation. For example, 999,999 would be transcribed as
999999
.
To learn more, see Features: Numerals.
Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant.
- Can include up to 25 search terms per request.
- Can send multiple instances in query string (for example,
search=speech&search=Friday
).
To learn more, see Features: Search.
Callback URL to provide if you would like your submitted audio to be processed asynchronously. When
passed, Deepgram will immediately respond with a
request_id
. When it has finished analyzing the audio, it will send a POST request to the
provided URL with an appropriate HTTP status code.
- You may embed basic authentication credentials in the callback URL.
- Only ports 80, 443, 8080, and 8443 can be used for callbacks
To learn more, see Features: Callback.
Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation.
- Can include up to 200 keywords per request.
- Can send multiple instances in query string (for example,
keywords=medicine&keywords=prescription
). - Can request multi-word keywords in a percent-encoded query string (for example,
keywords=miracle%20medicine
). When Deepgram listens for your supplied keywords, it separates them into individual words, then boosts or suppresses them individually. - Can append a positive or negative intensifier to either boost or suppress the recognition of particular words. Positive and negative values can be decimals.
- Follow best practices for keyword boosting.
- Support for out-of-vocabulary (OOV) keyword boosting when processing streaming audio is
currently in
beta; to fall back to previous keyword behavior, append the query parameter
keyword_boost=legacy
to your API request.
To learn more, see Features: Keywords.
Indicates whether Deepgram will segment speech into meaningful semantic units, which allows the model
to interact more naturally and effectively with speakers' spontaneous speech patterns. For example,
when humans speak
to each other conversationally, they often pause mid-sentence to reformulate their thoughts, or stop
and restart a badly-worded sentence. When
utterances
is set to
true
, these utterances are identified and returned in the transcript results.
By default, when utterances is enabled, it starts a new utterance after 0.8 s of silence. You can
customize the length of time used to determine where to split utterances by submitting the
utt_split
parameter.
To learn more, see Features: Utterances.
Length of time in seconds of silence between words that Deepgram will use when determining where to split utterances. Used when utterances is enabled.
Default: 0.8
To learn more, see Features: Utterance Split.
Request Body Schema
Request body when submitting pre-recorded audio. Accepts either:
- raw binary audio data. In this case, include a
Content-Type
header set to the audio MIME type. - JSON object with a single field from which the audio can be retrieved. In this case, include a
Content-Type
header set toapplication/json
.
URL of audio file to transcribe.
Responses
Status | Description |
---|---|
200 Success | Audio submitted for transcription. |
Response Schema
metadata: object
JSON-formatted ListenMetadata object.
request_id: string
Unique identifier for the submitted audio and derived data returned.
transaction_key: string
Blob of text that helps Deepgram engineers debug any problems you encounter. If you need help getting an API call to work correctly, send this key to us so that we can use it as a starting point when investigating any issues.
sha256: string
SHA-256 hash of the submitted audio data.
created: string
ISO-8601 timestamp that indicates when the audio was submitted.
duration: number
Duration in seconds of the submitted audio.
channels: integer
Number of channels detected in the submitted audio.
results: object
JSON-formatted ListenResults object.
channels: array
Array of JSON-formatted ChannelResult objects.
search: array
Array of JSON-formatted
SearchResults
.query: string
Term for which Deepgram is searching.
hits: array
Array of JSON-formatted Hit objects.
confidence: number
Value between 0 and 1 that indicates the model's relative confidence in this hit.
start: number
Offset in seconds from the start of the audio to where the hit occurs.
end: number
Offset in seconds from the start of the audio to where the hit ends.
snippet: string
Transcript that corresponds to the time between start and end.
alternatives: array
Array of JSON-formatted
ResultAlternative
objects. This array will have length n, where n matches the value of thealternatives
parameter passed in the request body.transcript: string
Single-string transcript containing what the model hears in this channel of audio.
confidence: number
Value between 0 and 1 indicating the model's relative confidence in this transcript.
words: array
Array of JSON-formatted Word objects.
word: string
Distinct word heard by the model.
start: number
Offset in seconds from the start of the audio to where the spoken word starts.
end: number
Offset in seconds from the start of the audio to where the spoken word ends.
confidence: number
Value between 0 and 1 indicating the model's relative confidence in this word.
Transcribe Streaming Audio
Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true WebSocket protocol, which makes integration with customer pipelines simple due to the wide array of client libraries available.
To use this endpoint, connect to wss://cab2b5852c84ae12.deepgram.com/v2
. TLS encryption will protect
your connection and data.
All data is sent to the streaming endpoint as binary-type WebSocket messages containing payloads that are the raw audio data. Because the protocol is full-duplex, you can stream in real-time and still receive transcription responses while uploading data.
When you are finished, send an empty (length zero) binary message to the server. The server will interpret it as a shutdown command, which means it wil finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.
To learn more about working with real-time streaming data and results, see Streaming Audio in Real-Time.
Query Parameters
AI model used to process submitted audio. Off-the-shelf Deepgram models include:
- general: (Default)
Optimized for everyday audio processing; if you aren't sure which model to select, start here. - meeting:
Optimized for conference room settings, which include multiple speakers with a single microphone. - phonecall:
Optimized for low-bandwidth audio phone calls. - voicemail:
Optimized for low-bandwidth audio clips with a single speaker. Derived from thephonecall
model. - finance:
Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented. - conversationalai:
Optimized to allow artificial intelligence technologies, such as chatbots, to interact with people in a human-like way. - video:
Optimized for audio sourced from videos.
You may also use a custom model associated with your account by including itscustom_id
.
To learn more, see Features: Model.
Version of the model to use.
Default: latest
Possible values: latest OR version_id
To learn more, see Features: Version.
BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:
Dutch
- nl: beta
MODELS: general
English
- en-US: United States (Default)
MODELS: general, meeting, phonecall, voicemail, finance, conversationalai, video - en-AU: Australia beta
MODELS: general - en-GB: United Kingdom
MODELS: general, phonecall - en-IN: India
MODELS: general, phonecall - en-NZ: New Zealand
MODELS: general
French
- fr:
MODELS: general - fr-CA: Canada
MODELS: general
German
- de:
MODELS: general
Hindi
- hi:
MODELS: general
Italian
- it: beta
MODELS: general
Japanese
- ja: beta
MODELS: general
Korean
- ko: beta
MODELS: general
Portuguese
- pt:
MODELS: general - pt-BR: Brazil
MODELS: general
Russian
- ru:
MODELS: general
Spanish
- es:
MODELS: general - es-419: Latin America beta
MODELS: general
Swedish
- sv: beta
MODELS: general
Turkish
- tr:
MODELS: general
To learn more, see Features: Language.
Indicates whether to add punctuation and capitalization to the transcript. To learn more, see Features: Punctuation.
Indicates whether to remove profanity from the transcript. To learn more, see Features: Profanity Filter.
Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:
- pci:
Redacts sensitive credit card information, including credit card number, expiration date, and CVV. - numbers: (or true)
Aggressively redacts strings of numerals. - ssn: beta
Redacts social security numbers.
Can send multiple instances in query string (for example,
redact=pci&redact=numbers
). When sending multiple values, redaction occurs in the
order you specify. (For instance, in this example, sensitive credit card information would be redacted
first, then strings of numbers.)
To learn more, see Features: Redaction.
Indicates whether to recognize speaker changes. When set to
true
, each word in the transcript will be assigned a speaker number starting at 0. To learn more, see
Features: Diarization.
Indicates whether to recognize alphanumeric strings. When set to
true
, whitespace will be removed between characters identified as part of an alphanumeric
string. To learn more, see
Features: Named-Entity Recognition (NER).
Indicates whether to transcribe each audio channel independently. When set to
true
, you will receive one transcript for each channel, which means you can apply a
different model to each channel using the model parameter (for example, set
model
to
general:phonecall
, which applies the
general
model to channel 0 and the
phonecall
model to channel 1).
To learn more, see Features: Multichannel.
Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears.
Default: 1
Indicates whether to convert numbers from written format (for example, one) to numerical format (for example, 1).
Deepgram can format numbers up to 999,999.
Converted numbers do not include punctuation. For example, 999,999 would be transcribed as
999999
.
To learn more, see Features: Numerals.
Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant.
- Can include up to 25 search terms per request.
- Can send multiple instances in query string (for example,
search=speech&search=Friday
).
To learn more, see Features: Search.
Callback URL to provide if you would like your submitted audio to be processed asynchronously. When
passed, Deepgram will immediately respond with a
request_id
. When it has finished analyzing the audio, it will send a POST request to the
provided URL with an appropriate HTTP status code.
- You may embed basic authentication credentials in the callback URL.
- Only ports 80, 443, 8080, and 8443 can be used for callbacks
For streaming audio,
callback
can be used to redirect streaming responses to a different server:
- If the callback URL begins with
http://
orhttps://
, then POST requests are sent to the callback server for each streaming response. - If the callback URL begins with
ws://
orwss://
, then a WebSocket connection is established with the callback server and WebSocket text messages are sent containing the streaming responses. - If a WebSocket callback connection is disconnected at any point, the entire real-time transcription stream is killed; this maintains the strong guarantee of a one-to-one relationship between incoming real-time connections and outgoing WebSocket callback connections.
To learn more, see Features: Callback.
Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation.
- Can include up to 200 keywords per request.
- Can send multiple instances in query string (for example,
keywords=medicine&keywords=prescription
). - Can request multi-word keywords in a percent-encoded query string (for example,
keywords=miracle%20medicine
). When Deepgram listens for your supplied keywords, it separates them into individual words, then boosts or suppresses them individually. - Can append a positive or negative intensifier to either boost or suppress the recognition of particular words. Positive and negative values can be decimals.
- Follow best practices for keyword boosting.
- Support for out-of-vocabulary (OOV) keyword boosting when processing streaming audio is
currently in
beta; to fall back to previous keyword behavior, append the query parameter
keyword_boost=legacy
to your API request.
To learn more, see Features: Keywords.
Indicates whether the streaming endpoint should send you updates to its transcription as more audio
becomes
available. By default, the streaming endpoint returns regular updates, which means transcription
results
will
likely change for a period of time. You can avoid receiving these updates by setting this flag to
false
.
Setting the flag to false
increases latency (usually by several seconds) because the
server will need to stabilize
the transcription before returning the final results for each piece of incoming audio. If you want
the
lowest-latency
streaming available, then set interim_results
to true
and handle the
corrected transcripts as they are returned.
To learn more, see Features: Interim Results.
Indicates whether Deepgram will detect whether a speaker has finished speaking (or paused for a
significant period of
time, indicating the completion of an idea). When Deepgram detects an endpoint, it assumes that no
additional data
will improve its prediction, so it immediately finalizes the result for the processed time range and
returns the
transcript with a speech_final
parameter set to true
.
For example, if you are working with a 15-second audio clip, but someone is speaking for only the first 3 seconds, endpointing allows you to get a finalized result after the first 3 seconds.
By default, endpointing is enabled and finalizes a transcript after 10 ms of silence.
Default: true
To learn more, see Features: Endpointing.
Expected encoding of the submitted streaming audio.
Options include:
linear16
: 16-bit, little endian, signed PCM WAV dataflac
: FLAC-encoded datamulaw
: mu-law encoded WAV dataamr-nb
: adaptive multi-rate narrowband codec (sample rate must be 8000)amr-wb
: adaptive multi-rate wideband codec (sample rate must be 16000)opus
: Ogg Opusspeex
: Ogg Speex
Only required when raw, headerless audio packets are sent to the streaming service. For
pre-recorded
audio or audio
submitted to the standard /listen
endpoint, we support over 40 popular codecs and do
not
require this parameter.
To learn more, see Features: Encoding.
Number of independent audio channels contained in submitted streaming audio. Only read when a value
is
provided for encoding
.
Default: 1
To learn more, see Features: Channels.
Sample rate of submitted streaming audio. Required (and only read) when a value is provided for
encoding
.
To learn more, see Features: Sample Rate.
Responses
Status | Description |
---|---|
201 Success | Audio submitted for transcription. |
Response Schema
channel_index: array
Information about the active channel in the form
[channel_index, total_number_of_channels]
.duration: number
Duration in seconds.
start: number
Offset in seconds.
is_final: boolean
Indicates that Deepgram has identified a point at which its transcript has reached maximum accuracy and is sending a definitive transcript of all audio up to that point. To learn more, see Understand Interim Transcripts.
speech_final: boolean
Indicates that Deepgram has detected an endpoint and immediately finalized its results for the processed time range. To learn more, see Understand Endpointing.
channel: object
alternatives: array
Array of JSON-formatted
ResultAlternative
objects. This array will have length n, where n matches the value of thealternatives
parameter passed in the request body.transcript: string
Single-string transcript containing what the model hears in this channel of audio.
confidence: number
Value between 0 and 1 indicating the model's relative confidence in this transcript.
words: array
Array of JSON-formatted Word objects.
word: string
Distinct word heard by the model.
start: number
Offset in seconds from the start of the audio to where the spoken word starts.
end: number
Offset in seconds from the start of the audio to where the spoken word ends.
confidence: number
Value between 0 and 1 indicating the model's relative confidence in this word.
metadata: object
request_id: string
Unique identifier for the submitted audio and derived data returned.
API Keys
Generate API keys.
Get All Keys
Returns the list of keys associated with your account.
Responses
Status | Description |
---|---|
200 Success | Keys found. |
Response Schema
keys: array
Array of API Key objects.
key: string
Identifier of API Key.
label: string
Comments associated with API Key.
Create Key
Don't want to reuse your username and password in your requests? Don't want to share credentials within your team? Want to have separate credentials for your staging and production systems? No problem: generate all the API keys you need and use them just as you would your username and password.
This is the only opportunity to retrieve your API secret, so be sure to record it someplace safe.
Request Body Schema
User-friendly name of the API Key.
Example: My API Key
Responses
Status | Description |
---|---|
201 Success | Key created. |
Response Schema
key: string
Your new API key. This should replace your username in authentication requests.
secret: string
Your new secret. This should replace your password in authentication requests.
label: string
The user-friendly name of the API key that you submitted in the body of the request.
Delete Key
Deletes the specified key.
Request Body Schema
Identifier of the project that contains the key that you want to delete.
The API key you wish to delete.
Example: x020gx00g0s0
Responses
Status | Description |
---|---|
204 Success | The API key was deleted. |
401 Error | Unauthorized. |
404 Error | The specified resource was not found. |