We've put together a glossary to help familiarize you with specialized Deepgram terminology. If you have a suggestion for a term to include, submit it through our feedback form.
A word that is constructed from (generally first) letters of the phrase it stands for and is pronounced as its own word. For example, ELISA, AIDS, GABA.
Adaptive Multi-Rate (AMR)
Lossy audio coding algorithm specifically optimized for speech coding, which transmits both speech parameters and waveform signals. AMR dynamically adjusts speech bit rate and channel coding to adapt to various network conditions, maximizing the possibility of receiving audio signals. It also reduces bandwidth during periods of silence using technologies like discontinuous transmission (DTX), voice activity detection (VAD), and comfort noise generation (CNG). AMR is most frequently used in digital telephony (such as VoIP, Wi-Fi, and satellite telephony), multimedia-audio and videoconferencing, digital radio broadcasting, and audiobooks. It provides good quality speech at low cost and is available in both a narrowband version (speech bandwidth of 300-3400 Hz) and a wideband version (speech bandwidth of 50-7000 Hz).
Collection of resources that provide a large, representative sample of data that can be transcribed, labeled, and then used to train a model. Datasets can contain multiple resources, and you can add resources to a dataset, then retrain your model to get better results.
Free Lossless Audio Codec (FLAC)
Lossless audio coding algorithm specifically designed for the efficient packing of audio data, which allow it to be streamed and decoded quickly, independent of compression level. FLAC can typically reduce audio data to between 50-70% of its original size, though the final size depends on the density and amplitude of the audio being compressed. Because it is lossless, FLAC is suitable as an archive method, as an exact duplication of the original data can be recovered at any time. FLAC is an open format with royalty-free licensing and supports metadata tagging, album cover art, and fast seeking.
A word that is constructed from (generally first) letters of the phrase it stands for and is pronounced as individual letters. For example, DNA, RT-PCR.
Process through which data is annotated for use with machine learning data models. There are various ways to label data, all of which help models specialize their predictions.
Languages marked as (Labs) are continuously being updated, but do not yet have support for custom training. We’re looking for data partners to help us accelerate the performance of these languages. If you would like to partner with us in developing these languages, please let us know, and we would be happy to discuss next steps.
Class of compression algorithms that allow original data to be perfectly reconstructed from compressed data. Lossless compression is used in cases where deviations from the original data would be unfavorable. In audio, these cases most often involve archiving or production purposes where storage space is unlimited. It is also often used as a component within lossy data compression technologies (for example, in lossless mid/side joint stereo preprocessing by MP3 encoders and other lossy audio encoders).
Class of compression algorithms that irreversibly degrade data quality to reduce size for storage, handling, and transmission. Well-designed lossy compression technology can significantly reduce file sizes before degradation is noticed by the end-user. Lossy compression is most commonly used to compress multimedia data (audio, video, and images), especially in applications such as streaming media and internet telephony.
Artifact created by the training process that algorithms can use to recognize certain types of patterns.
Lossy audio coding algorithm that reduces the dynamic range of an audio signal by removing data from the upper and lower frequencies that humans are generally unable to hear, thereby improving the signal-to-noise ratio without increasing the amount of data. Mu-law compresses audio data to 50% of its original size. Mu-law is primarily used as a telecommunication standard in North America and Japan.
Named Entity Recognition (NER)
Subtask of information extraction that locates and classifies named entities mentioned in unstructured text into pre-defined categories, such as names of people, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. State-of-the-art NER systems for English produce near-human performance.
Opus packets encapsulated within Ogg containers, per the original Opus specification.
Lossy audio coding algorithm designed to efficiently code speech and general audio in a single format, while remaining low-latency enough for real-time interactive communication and low-complexity enough for low-end embedded processors. Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher quality than any other standard audio format at any given bitrate until transparency is reached, including MP3, AAC, and HE-AAC. Opus packets may be wrapped in a network packet that supplies the packet length. Optionally, a self-delimited packet format that adds one or two additional bytes per packet to encode the packet length may be used.
Real-Time Messaging Protocol (RTMP)
Protocol used to stream audio, video, and data over the internet. While RTMP was originally developed by Macromedia to stream between a Flash player and server, it continues to be used as an open-source protocol and is accepted by many streaming providers, such as Zoom.
Audio file and its associated data, such as a transcription. Resources can be collected into datasets, and a single resource can belong to multiple datasets.
Lossy audio coding algorithm specifically tuned for the reproduction of human speech and targeted at voice over IP (VoIP) and file-based compression. Design goals included optimizing for high quality speech and low bit rate, so Speex uses multiple bit rates and supports ultra-wideband (32 kHz sampling rate), wideband (16 kHz sampling rate), and narrowband (telephone quality, 8 kHz sampling rate). Speex is now considered obsolete; its successor is the more modern Opus codec, which surpasses its performance in most areas except at the lowest sample rates. However, Speex does claim to be free from patent restrictions.
Process through which a machine learning algorithm is fed many examples of data and human input to help it identify and replicate a decision an expert would make when provided with that same information.
Process through which speech in an audio file is converted into written text.
Voice Activity Detection (VAD)
Detection of the presence or absence of human speech, mainly used in speech coding and recognition. VAD can both facilitate speech processing and be used to deactivate processes during non-speech sections of audio, thereby saving computation effort and network bandwidth. Multiple VAD algorithms have been developed that provide varying features and compromises between latency, sensitivity, accuracy, and computational cost. Some VAD algorithms also provide further analysis, such as by identifying whether speech is voiced, unvoiced, or sustained.
Word Error Rate (WER)
Common metric used to evaluate the effectiveness of automatic speech recognition systems (ASRs) and compare the accuracy of the transcripts they produce. The more technical, industry-specific, “accented”, and noisy your speech data is, the more likely that both ASRs (and humans) will yield a high WER.