Audio and Video New: 1.10
Modern deep learning technologies offer much better performance on multimedia data than previous approaches, so there are lots of opportunities for cool new products and features. Prodigy lets you create training data for a variety of common tasks, such as transcription, classification and speaker diarization. You can also use Prodigy as a library of simple building blocks to construct a custom solution, even if you have to cross-reference audio, video, text and metadata.
Manual audio annotation
The audio.manual
recipe lets you load in audio or video files and add
labelled regions to them. Under the hood, Prodigy will save the start and end
timestamps, as well as the label for each region. You can click and drag to add
a region, resize existing regions by dragging the start and end and remove
regions by clicking their × button. Annotated regions can also
overlap, if needed.
The following command starts the Prodigy server, loads in audio files from a
directory ./recordings
and allows annotating regions on them for two labels,
SPEAKER_1
and SPEAKER_2
:
Recipe command
prodigy audio.manual speaker_data ./recordings --label SPEAKER_1,SPEAKER_2
By default, the audio loader expects to load files from a directory. The files will be encoded as base64 and the encoded data will be removed before the annotations are placed in the database.
Manual video annotation
The audio
and audio_manual
interfaces also support video files
out-of-the-box – all you need to do is load in data with a key "video"
containing the URL or base64-encoded data. The easiest way is to use
audio.manual
with --loader video
. The video is now displayed above the
waveform and you can annotate regions referring to timestamps of the video. This
is especially helpful when annotating who is speaking, as the video can hold a
lot of clues.
Recipe command
prodigy audio.manual speaker_data ./recordings --loader video --label SPEAKER_1,SPEAKER_2
Audio or video transcription
Prodigy’s blocks
interface lets you combine multiple different
interfaces into one – for example, audio
and text_input
. The
built-in audio.transcribe
workflow uses this combination to provide a
straightforward audio-transcription interface. The free-form text typed in by
the user will be saved to the annotation task as the key "transcript"
. The
following command starts the server with a directory of recordings and saves the
annotations to a dataset:
Recipe command
prodigy audio.transcribe speaker_transcripts ./recordings
To make it easier to toggle play and pause as you transcribe and to prevent
clashes with the text input field (like with the default enter), this
recipe lets you customize the keyboard shortcuts. To toggle play/pause, you can
press
command/option/alt/ctrl+enter
or provide your own overrides via --playpause-key
, for instance
--playpause-key command+w
.
Audio or video classification
Custom recipes also let you build your very own workflows for audio or video annotations. For instance, you might want to load in audio recordings and sort them into categories, e.g. to classify the type of noise and whether it’s produced by a car, a plane or something else.
The custom recipe for this workflow is pretty straightforward: using the
Audio
loader, you can load your files from
a directory. You can then add a list of "options"
to each incoming example.
The "text"
value is displayed to the user and the "id"
is used under the
hood. When you select options, their "id"
values will be added to the task as
"accept"
, e.g. "accept": ["PLANE"]
. For more details on the available UI
settings, check out the
interface docs.
recipe.pyimport prodigy
from prodigy.components.loaders import Audio
@prodigy.recipe("classify-audio")
def classify_audio(dataset, source):
def get_stream():
# Load the directory of audio files and add options to each task
stream = Audio(source)
for eg in stream:
eg["options"] = [
{"id": "CAR", "text": "🚗 Car"},
{"id": "PLANE", "text": "✈️ Plane"},
{"id": "OTHER", "text": "Other / Unclear"}
]
yield eg
return {
"dataset": dataset,
"stream": get_stream(),
"view_id": "choice",
"config": {
"choice_style": "single", # or "multiple"
"choice_auto_accept": True,
"audio_loop": True,
"show_audio_minimap": False
}
}
Command-line usage
prodigy classify-audio noise_data ./recordings -F recipe.py
Annotating with a model in the loop
Custom recipes let you integrate machine learning models using any framework of
your choice. You can use a pretrained model to help you label segments and
bootstrap your manual annotations and even implement an
update
callback to update and improve the model
as you annotate. The following examples are implemented using the
pyannote.audio
library, which is
powered by PyTorch and provides neural building blocks for speaker
diarization, including speech activity detection, speaker change detection and
speaker embedding.
Speech activity detection
Speech activity detection (SAD) is the task of detecting the presence or absence
of human speech. pyannote.audio
provides
pretrained models and pipelines
that you can use to bootstrap your speech/non-speech annotations with Prodigy.
The pyannote.sad.manual
recipe will stream in .wav
files in chunks and tag
the detected speech regions as SPEECH
. You can then adjust the regions
manually if needed.
Recipe command
prodigy pyannote.sad.manual speech_activity ./data/wav --chunk 5
Speaker segmentation and speaker change detection
Speaker change detection lets you detect when a different person starts speaking
and allows you to extract speaker segments from your audio input.
pyannote.audio
provides
pretrained models and pipelines
that you can use to bootstrap your speaker change and speaker segmentation
annotations with Prodigy. The pyannote.scd.binary
recipe will stream in .wav
files and mark the occuring speaker changes. You can then accept the
annotation if the change is correctly detected or reject it if the
suggestion is wrong.
Recipe command
prodigy pyannote.scd.binary speaker_change ./data/wav