Speech Recognition API Reference

Deepgram's Speech Recognition API gives you streamlined access to automatic transcription from Deepgram's off-the-shelf and trained speech recognition models. This feature is very fast, can understand nearly every audio format available, and is customizable. You can customize your transcript using various query parameters and apply general purpose and custom-trained AI models.

Endpoint

https://brain.deepgram.com/v2

Authentication

All requests to the API should include a basic Authorization header that references the Base64-encoded username (or email address you used to sign up) and password of your Deepgram account.

For example, for user gandalf with password mellon, the base64-encoded value of gandalf:mellon is Z2FuZGFsZjptZWxsb24=. So Gandalf's requests to the Deepgram API should all include the following header: Authorization: Basic Z2FuZGFsZjptZWxsb24=.

Security Scheme Type HTTP
HTTP Authorization Scheme basic

Transcription

High-speed transcription of either pre-recorded or streaming audio. This feature is very fast, can understand nearly every audio format available, and is customizable. You can customize your transcript using various query parameters and apply general purpose and custom-trained AI models.

Transcribe Pre-recorded Audio

Transcribes the specified audio file.

Query Parameters

model:
string

AI model used to process submitted audio. Off-the-shelf Deepgram models include:

  • general: (Default)
    Optimized for everyday audio processing; if you aren't sure which model to select, start here.
  • meeting:
    Optimized for conference room settings, which include multiple speakers with a single microphone.
  • phonecall:
    Optimized for low-bandwidth audio phone calls.
  • voicemail:
    Optimized for low-bandwidth audio clips with a single speaker. Derived from the phonecall model.
  • finance:
    Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.
  • conversationalai:
    Optimized to allow artificial intelligence technologies, such as chatbots, to interact with people in a human-like way.
  • video:
    Optimized for audio sourced from videos.
  • <custom_id>:
    You may also use a custom model associated with your account by including its custom_id.

To learn more, see Features: Model.

version:
string

Version of the model to use.

Default: latest

Possible values: latest OR <version_id>

To learn more, see Features: Version.

language:
string

BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:

English

  • en-US: (Default)
    MODELS: general, meeting, phonecall, voicemail, finance, conversationalai
  • en-GB:
    MODELS: general, phonecall
  • en-IN:
    MODELS: general, phonecall
  • en-NZ:
    MODELS: general

French

  • fr:
    MODELS: general
  • fr-CA:
    MODELS: general

German

  • de:
    MODELS: general

Hindi

  • hi:
    MODELS: general

Korean

  • ko:
    MODELS: general

Portuguese

  • pt:
    MODELS: general
  • pt-BR:
    MODELS: general

Russian

  • ru:
    MODELS: general

Spanish

  • es:
    MODELS: general

Turkish

  • tr:
    MODELS: general

To learn more, see Features: Language.

punctuate:
boolean

Indicates whether to add punctuation and capitalization to the transcript. To learn more, see Features: Punctuation.

profanity_filter:
boolean

Indicates whether to remove profanity from the transcript. To learn more, see Features: Profanity Filter.

redact:
any

Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:

  • pci:
    Redacts sensitive credit card information, including credit card number, expiration date, and CVV.
  • numbers: (or true)
    Aggressively redacts strings of numerals.
  • ssn: (beta)
    Redacts social security numbers.

Can send multiple instances in query string (for example, redact=pci&redact=numbers). When sending multiple values, redaction occurs in the order you specify. For instance, in this example, sensitive credit card information would be redacted first, then strings of numbers.

To learn more, see Features: Redaction.

diarize:
boolean

Indicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0. To learn more, see Features: Diarization.

ner:
boolean

Indicates whether to recognize alphanumeric strings. When set to true, whitespace will be removed between characters identified as part of an alphanumeric string. To learn more, see Features: Named-Entity Recognition (NER).

multichannel:
boolean

Indicates whether to transcribe each audio channel independently. When set to true, you will receive one transcript for each channel, which means you can apply a different model to each channel using the model parameter (for example, set model to general:phonecall, which applies the general model to channel 0 and the phonecall model to channel 1).

To learn more, see Features: Multichannel.

alternatives:
integer

Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears.

Default: 1

numerals:
boolean

Indicates whether to convert numbers from written format (for example, one) to numerical format (for example, 1).

Deepgram can format numbers up to 999,999.

Converted numbers do not include punctuation. For example, 999,999 would be transcribed as 999999.

To learn more, see Features: Numerals.

search:
any

Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant.

  • Can include up to 25 search terms per request.
  • Can send multiple instances in query string (for example, search=speech&search=Friday).

To learn more, see Features: Search.

callback:
string

Callback URL to provide if you would like your submitted audio to be processed asynchronously. When passed, Deepgram will immediately respond with a request_id. When it has finished analyzing the audio, it will send a POST request to the provided URL with an appropriate HTTP status code.

  • You may embed basic authentication credentials in the callback URL.
  • Only ports 80, 443, 8080, and 8443 can be used for callbacks

To learn more, see Features: Callback.

keywords:
any

Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation.

  • Can include up to 200 keywords per request.
  • Can send multiple instances in query string (for example, keywords=medicine&keywords=prescription).
  • Can request multi-word keywords in a percent-encoded query string (for example, keywords=miracle%20medicine). When Deepgram listens for your supplied keywords, it separates them into individual words, then boosts or suppresses them individually.
  • Can append a positive or negative intensifier to either boost or suppress the recognition of particular words. Positive and negative values can be decimals.
  • Follow best practices for keyword boosting.
  • Support for out-of-vocabulary (OOV) keyword boosting when processing streaming audio is currently in beta; to fall back to previous keyword behavior, append the query parameter keyword_boost=legacy to your API request.

To learn more, see Features: Keywords.

utterances:
boolean

Indicates whether Deepgram will segment speech into meaningful semantic units, which allows the model to interact more naturally and effectively with speakers' spontaneous speech patterns. For example, when humans speak to each other conversationally, they often pause mid-sentence to reformulate their thoughts, or stop and restart a badly-worded sentence. When utterances is set to true, these utterances are identified and returned in the transcript results.

By default, when utterances is enabled, it starts a new utterance after 0.8 s of silence. You can customize the length of time used to determine where to split utterances by submitting the utt_split parameter.

To learn more, see Features: Utterances.

utt_split:
number

Length of time in seconds of silence between words that Deepgram will use when determining where to split utterances. Used when utterances is enabled.

Default: 0.8

To learn more, see Features: Utterance Split.

Request Body Schema

Request body when submitting pre-recorded audio. Accepts either:

  • raw binary audio data. In this case, include a Content-Type header set to the audio MIME type.
  • JSON object with a single field from which the audio can be retrieved. In this case, include a Content-Type header set to application/json.
url: string

URL of audio file to transcribe.

Responses

Status Description
200 Success Audio submitted for transcription.
Response Schema
  • metadata: object

    JSON-formatted ListenMetadata object.

    • request_id: string

      Unique identifier for the submitted audio and derived data returned.

    • transaction_key: string

      Blob of text that helps Deepgram engineers debug any problems you encounter. If you need help getting an API call to work correctly, send this key to us so that we can use it as a starting point when investigating any issues.

    • sha256: string

      SHA-256 hash of the submitted audio data.

    • created: string

      ISO-8601 timestamp that indicates when the audio was submitted.

    • duration: number

      Duration in seconds of the submitted audio.

    • channels: integer

      Number of channels detected in the submitted audio.

  • results: object

    JSON-formatted ListenResults object.

    • channels: array

      Array of JSON-formatted ChannelResult objects.

      • search: array

        Array of JSON-formatted SearchResults.

        • query: string

          Term for which Deepgram is searching.

        • hits: array

          Array of JSON-formatted Hit objects.

          • confidence: number

            Value between 0 and 1 that indicates the model's relative confidence in this hit.

          • start: number

            Offset in seconds from the start of the audio to where the hit occurs.

          • end: number

            Offset in seconds from the start of the audio to where the hit ends.

          • snippet: string

            Transcript that corresponds to the time between start and end.

      • alternatives: array

        Array of JSON-formatted ResultAlternative objects. This array will have length n, where n matches the value of the alternatives parameter passed in the request body.

        • transcript: string

          Single-string transcript containing what the model hears in this channel of audio.

        • confidence: number

          Value between 0 and 1 indicating the model's relative confidence in this transcript.

        • words: array

          Array of JSON-formatted Word objects.

          • word: string

            Distinct word heard by the model.

          • start: number

            Offset in seconds from the start of the audio to where the spoken word starts.

          • end: number

            Offset in seconds from the start of the audio to where the spoken word ends.

          • confidence: number

            Value between 0 and 1 indicating the model's relative confidence in this word.

Endpoint
POST /listen

Transcribe Streaming Audio

Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true WebSocket protocol, which makes integration with customer pipelines simple due to the wide array of client libraries available.

To use this endpoint, connect to wss://cab2b5852c84ae12.deepgram.com/v2. TLS encryption will protect your connection and data.

All data is sent to the streaming endpoint as binary-type WebSocket messages containing payloads that are the raw audio data. Because the protocol is full-duplex, you can stream in real-time and still receive transcription responses while uploading data.

When you are finished, send an empty (length zero) binary message to the server. The server will interpret it as a shutdown command, which means it wil finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.

To learn more about working with real-time streaming data and results, see Streaming Audio in Real-Time.

Query Parameters

model:
string

AI model used to process submitted audio. Off-the-shelf Deepgram models include:

  • general: (Default)
    Optimized for everyday audio processing; if you aren't sure which model to select, start here.
  • meeting:
    Optimized for conference room settings, which include multiple speakers with a single microphone.
  • phonecall:
    Optimized for low-bandwidth audio phone calls.
  • voicemail:
    Optimized for low-bandwidth audio clips with a single speaker. Derived from the phonecall model.
  • finance:
    Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.
  • conversationalai:
    Optimized to allow artificial intelligence technologies, such as chatbots, to interact with people in a human-like way.
  • video:
    Optimized for audio sourced from videos.

You may also use a custom model associated with your account by including itscustom_id.

To learn more, see Features: Model.

version:
string

Version of the model to use.

Default: latest

Possible values: latest OR version_id

To learn more, see Features: Version.

language:
string

BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:

English

  • en-US: (Default)
    MODELS: general, meeting, phonecall, voicemail, finance, conversationalai
  • en-GB:
    MODELS: general, phonecall
  • en-IN:
    MODELS: general, phonecall
  • en-NZ:
    MODELS: general

French

  • fr:
    MODELS: general
  • fr-CA:
    MODELS: general

German

  • de:
    MODELS: general

Hindi

  • hi:
    MODELS: general

Korean

  • ko:
    MODELS: general

Portuguese

  • pt:
    MODELS: general
  • pt-BR:
    MODELS: general

Russian

  • ru:
    MODELS: general

Spanish

  • es:
    MODELS: general

Turkish

  • tr:
    MODELS: general

To learn more, see Features: Language.

punctuate:
boolean

Indicates whether to add punctuation and capitalization to the transcript. To learn more, see Features: Punctuation.

profanity_filter:
boolean

Indicates whether to remove profanity from the transcript. To learn more, see Features: Profanity Filter.

redact:
any

Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:

  • pci:
    Redacts sensitive credit card information, including credit card number, expiration date, and CVV.
  • numbers: (or true)
    Aggressively redacts strings of numerals.
  • ssn: (beta)
    Redacts social security numbers.

Can send multiple instances in query string (for example, redact=pci&redact=numbers). When sending multiple values, redaction occurs in the order you specify. (For instance, in this example, sensitive credit card information would be redacted first, then strings of numbers.)

To learn more, see Features: Redaction.

diarize:
boolean

Indicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0. To learn more, see Features: Diarization.

ner:
boolean

Indicates whether to recognize alphanumeric strings. When set to true, whitespace will be removed between characters identified as part of an alphanumeric string. To learn more, see Features: Named-Entity Recognition (NER).

multichannel:
boolean

Indicates whether to transcribe each audio channel independently. When set to true, you will receive one transcript for each channel, which means you can apply a different model to each channel using the model parameter (for example, set model to general:phonecall, which applies the general model to channel 0 and the phonecall model to channel 1).

To learn more, see Features: Multichannel.

alternatives:
integer

Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears.

Default: 1

numerals:
boolean

Indicates whether to convert numbers from written format (for example, one) to numerical format (for example, 1).

Deepgram can format numbers up to 999,999.

Converted numbers do not include punctuation. For example, 999,999 would be transcribed as 999999.

To learn more, see Features: Numerals.

search:
any

Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant.

  • Can include up to 25 search terms per request.
  • Can send multiple instances in query string (for example, search=speech&search=Friday).

To learn more, see Features: Search.

callback:
string

Callback URL to provide if you would like your submitted audio to be processed asynchronously. When passed, Deepgram will immediately respond with a request_id. When it has finished analyzing the audio, it will send a POST request to the provided URL with an appropriate HTTP status code.

  • You may embed basic authentication credentials in the callback URL.
  • Only ports 80, 443, 8080, and 8443 can be used for callbacks

For streaming audio, callback can be used to redirect streaming responses to a different server:

  • If the callback URL begins with http:// or https://, then POST requests are sent to the callback server for each streaming response.
  • If the callback URL begins with ws:// or wss://, then a WebSocket connection is established with the callback server and WebSocket text messages are sent containing the streaming responses.
  • If a WebSocket callback connection is disconnected at any point, the entire real-time transcription stream is killed; this maintains the strong guarantee of a one-to-one relationship between incoming real-time connections and outgoing WebSocket callback connections.

To learn more, see Features: Callback.

keywords:
any

Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation.

  • Can include up to 200 keywords per request.
  • Can send multiple instances in query string (for example, keywords=medicine&keywords=prescription).
  • Can request multi-word keywords in a percent-encoded query string (for example, keywords=miracle%20medicine). When Deepgram listens for your supplied keywords, it separates them into individual words, then boosts or suppresses them individually.
  • Can append a positive or negative intensifier to either boost or suppress the recognition of particular words. Positive and negative values can be decimals.
  • Follow best practices for keyword boosting.
  • Support for out-of-vocabulary (OOV) keyword boosting when processing streaming audio is currently in beta; to fall back to previous keyword behavior, append the query parameter keyword_boost=legacy to your API request.

To learn more, see Features: Keywords.

interim_results:
boolean

Indicates whether the streaming endpoint should send you updates to its transcription as more audio becomes available. By default, the streaming endpoint returns regular updates, which means transcription results will likely change for a period of time. You can avoid receiving these updates by setting this flag to false.

Setting the flag to false increases latency (usually by several seconds) because the server will need to stabilize the transcription before returning the final results for each piece of incoming audio. If you want the lowest-latency streaming available, then set interim_results to true and handle the corrected transcripts as they are returned.

To learn more, see Features: Interim Results.

endpointing:
boolean

Indicates whether Deepgram will detect whether a speaker has finished speaking (or paused for a significant period of time, indicating the completion of an idea). When Deepgram detects an endpoint, it assumes that no additional data will improve its prediction, so it immediately finalizes the result for the processed time range and returns the transcript with a speech_final parameter set to true.

For example, if you are working with a 15-second audio clip, but someone is speaking for only the first 3 seconds, endpointing allows you to get a finalized result after the first 3 seconds.

By default, endpointing is enabled and finalizes a transcript after 10 ms of silence. You can customize the length of time used to detect whether a speaker has finished speaking by submitting the vad_turnoff parameter.

Default: true

To learn more, see Features: Endpointing.

vad_turnoff:
integer

Length of time in milliseconds of silence that voice activation detection (VAD) will use to detect that a speaker has finished speaking. Used when endpointing is enabled. Defaults to 10 ms. Deepgram customers may configure a value between 10 ms and 500 ms; on-premise customers may remove this restriction.

Default: 10

To learn more, see Features: Voice Activity Detection (VAD).

encoding:
string

Expected encoding of the submitted streaming audio.

Options include:

  • linear16: 16-bit, little endian, signed PCM WAV data
  • flac: FLAC-encoded data
  • mulaw: mu-law encoded WAV data
  • amr-nb: adaptive multi-rate narrowband codec (sample rate must be 8000)
  • amr-wb: adaptive multi-rate wideband codec (sample rate must be 16000)
  • opus: Ogg Opus
  • speex: Ogg Speex

Only required when raw, headerless audio packets are sent to the streaming service. For pre-recorded audio or audio submitted to the standard /listen endpoint, we support over 40 popular codecs and do not require this parameter.

To learn more, see Features: Encoding.

channels:
integer

Number of independent audio channels contained in submitted streaming audio. Only read when a value is provided for encoding.

Default: 1

To learn more, see Features: Channels.

sample_rate:
integer

Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding.

To learn more, see Features: Sample Rate.

Responses

Status Description
201 Success Audio submitted for transcription.
Response Schema
  • channel_index: array

    Information about the active channel in the form [channel_index, total_number_of_channels].

  • duration: number

    Duration in seconds.

  • start: number

    Offset in seconds.

  • is_final: boolean

    Indicates that Deepgram has identified a point at which its transcript has reached maximum accuracy and is sending a definitive transcript of all audio up to that point. To learn more, see Understand Interim Transcripts.

  • speech_final: boolean

    Indicates that Deepgram has detected an endpoint and immediately finalized its results for the processed time range. To learn more, see Understand Endpointing.

  • channel: object

    • alternatives: array

      Array of JSON-formatted ResultAlternative objects. This array will have length n, where n matches the value of the alternatives parameter passed in the request body.

      • transcript: string

        Single-string transcript containing what the model hears in this channel of audio.

      • confidence: number

        Value between 0 and 1 indicating the model's relative confidence in this transcript.

      • words: array

        Array of JSON-formatted Word objects.

        • word: string

          Distinct word heard by the model.

        • start: number

          Offset in seconds from the start of the audio to where the spoken word starts.

        • end: number

          Offset in seconds from the start of the audio to where the spoken word ends.

        • confidence: number

          Value between 0 and 1 indicating the model's relative confidence in this word.

  • metadata: object

    • request_id: string

      Unique identifier for the submitted audio and derived data returned.

Endpoint
WSS /listen/stream

API Keys

Generate API keys.

Get All Keys

Returns the list of keys associated with your account.

Responses

Status Description
200 Success Keys found.
Response Schema
  • keys: array

    Array of API Key objects.

    • key: string

      Identifier of API Key.

    • label: string

      Comments associated with API Key.

Endpoint
GET /keys

Create Key

Don't want to reuse your username and password in your requests? Don't want to share credentials within your team? Want to have separate credentials for your staging and production systems? No problem: generate all the API keys you need and use them just as you would your username and password.

This is the only opportunity to retrieve your API secret, so be sure to record it someplace safe.

Request Body Schema

label:
string

User-friendly name of the API Key.

Example: My API Key

Responses

Status Description
201 Success Key created.
Response Schema
  • key: string

    Your new API key. This should replace your username in authentication requests.

  • secret: string

    Your new secret. This should replace your password in authentication requests.

  • label: string

    The user-friendly name of the API key that you submitted in the body of the request.

Endpoint
POST /keys

Delete Key

Deletes the specified key.

Request Body Schema

project_id: string

Identifier of the project that contains the key that you want to delete.

key: string

The API key you wish to delete.

Example: x020gx00g0s0

Responses

Status Description
204 Success The API key was deleted.
401 Error Unauthorized.
404 Error The specified resource was not found.
Endpoint
DELETE /keys