Audio and Video New: 1.10

Modern deep learning technologies offer much better performance on multimedia data than previous approaches, so there are lots of opportunities for cool new products and features. Prodigy lets you create training data for a variety of common tasks, such as transcription, classification and speaker diarization. You can also use Prodigy as a library of simple building blocks to construct a custom solution, even if you have to cross-reference audio, video, text and metadata.

Manual audio annotation

The audio.manual recipe lets you load in audio or video files and add labelled regions to them. Under the hood, Prodigy will save the start and end timestamps, as well as the label for each region. You can click and drag to add a region, resize existing regions by dragging the start and end and remove regions by clicking their × button. Annotated regions can also overlap, if needed.

The following command starts the Prodigy server, loads in audio files from a directory ./recordings and allows annotating regions on them for two labels, SPEAKER_1 and SPEAKER_2:

Recipe command

prodigy
audio.manual
speaker_data
./recordings
--label SPEAKER_1,SPEAKER_2
prodigy audio.manual speaker_data ./recordings --label SPEAKER_1,SPEAKER_2

This live demo requires JavaScript to be enabled.

By default, the audio loader expects to load files from a directory. The files will be encoded as base64 and the encoded data will be removed before the annotations are placed in the database.

Manual video annotation

The audio and audio_manual interfaces also support video files out-of-the-box – all you need to do is load in data with a key "video" containing the URL or base64-encoded data. The easiest way is to use audio.manual with --loader video. The video is now displayed above the waveform and you can annotate regions referring to timestamps of the video. This is especially helpful when annotating who is speaking, as the video can hold a lot of clues.

Recipe command

prodigy
audio.manual
speaker_data
./recordings
--loader video
--label SPEAKER_1,SPEAKER_2
prodigy audio.manual speaker_data ./recordings --loader video --label SPEAKER_1,SPEAKER_2

This live demo requires JavaScript to be enabled.

Audio or video transcription

Prodigy’s blocks interface lets you combine multiple different interfaces into one – for example, audio and text_input. The built-in audio.transcribe workflow uses this combination to provide a straightforward audio-transcription interface. The free-form text typed in by the user will be saved to the annotation task as the key "transcript". The following command starts the server with a directory of recordings and saves the annotations to a dataset:

Recipe command

prodigy
audio.transcribe
speaker_transcripts
./recordings
prodigy audio.transcribe speaker_transcripts ./recordings

This live demo requires JavaScript to be enabled.

To make it easier to toggle play and pause as you transcribe and to prevent clashes with the text input field (like with the default enter), this recipe lets you customize the keyboard shortcuts. To toggle play/pause, you can press command/option/alt/ctrl+enter or provide your own overrides via --playpause-key, for instance --playpause-key command+w.

Audio or video classification

Custom recipes also let you build your very own workflows for audio or video annotations. For instance, you might want to load in audio recordings and sort them into categories, e.g. to classify the type of noise and whether it’s produced by a car, a plane or something else.

This live demo requires JavaScript to be enabled.

The custom recipe for this workflow is pretty straightforward: using the Audio loader, you can load your files from a directory. You can then add a list of "options" to each incoming example. The "text" value is displayed to the user and the "id" is used under the hood. When you select options, their "id" values will be added to the task as "accept", e.g. "accept": ["PLANE"]. For more details on the available UI settings, check out the interface docs.

recipe.pyimport prodigy
from prodigy.components.loaders import Audio

@prodigy.recipe("classify-audio")
def classify_audio(dataset, source):
    def get_stream():
        # Load the directory of audio files and add options to each task
        stream = Audio(source)
        for eg in stream:
            eg["options"] = [
                {"id": "CAR", "text": "🚗 Car"},
                {"id": "PLANE", "text": "✈️ Plane"},
                {"id": "OTHER", "text": "Other / Unclear"}
            ]
            yield eg

    return {
        "dataset": dataset,
        "stream": get_stream(),
        "view_id": "choice",
        "config": {
            "choice_style": "single",  # or "multiple"
            "choice_auto_accept": True,
            "audio_loop": True,
            "show_audio_minimap": False
        }
    }

Command-line usage

prodigy
classify-audio
noise_data
./recordings
-F recipe.py
prodigy classify-audio noise_data ./recordings -F recipe.py

Annotating with a model in the loop

Custom recipes let you integrate machine learning models using any framework of your choice. You can use a pretrained model to help you label segments and bootstrap your manual annotations and even implement an update callback to update and improve the model as you annotate. The following examples are implemented using the pyannote.audio library, which is powered by PyTorch and provides neural building blocks for speaker diarization, including speech activity detection, speaker change detection and speaker embedding.

Speech activity detection

Speech activity detection (SAD) is the task of detecting the presence or absence of human speech. pyannote.audio provides pretrained models and pipelines that you can use to bootstrap your speech/non-speech annotations with Prodigy. The pyannote.sad.manual recipe will stream in .wav files in chunks and tag the detected speech regions as SPEECH. You can then adjust the regions manually if needed.

Recipe command

prodigy
pyannote.sad.manual
speech_activity
./data/wav 
--chunk 5
prodigy pyannote.sad.manual speech_activity ./data/wav  --chunk 5

This live demo requires JavaScript to be enabled.

Speaker segmentation and speaker change detection

Speaker change detection lets you detect when a different person starts speaking and allows you to extract speaker segments from your audio input. pyannote.audio provides pretrained models and pipelines that you can use to bootstrap your speaker change and speaker segmentation annotations with Prodigy. The pyannote.scd.binary recipe will stream in .wav files and mark the occuring speaker changes. You can then accept the annotation if the change is correctly detected or reject it if the suggestion is wrong.

Recipe command

prodigy
pyannote.scd.binary
speaker_change
./data/wav 
prodigy pyannote.scd.binary speaker_change ./data/wav

This live demo requires JavaScript to be enabled.

Usage

Audio and Video New: 1.10

Manual audio annotation

Recipe command

Manual video annotation

Recipe command

Audio or video transcription

Recipe command

Audio or video classification

Command-line usage

Annotating with a model in the loop

Speech activity detection

Recipe command

Speaker segmentation and speaker change detection

Recipe command