Loaders and Input Data
Loaders are helper classes to turn a source file into an iterable stream. Calling a loader returns a generator that yields annotation tasks in Prodigy’s JSON format. Prodigy supports streaming in data from a variety of different formats, via the available loader components. To load data from other formats or sources, like a database or an API, you can write your own loader function that returns an iterable stream, and include it in your custom recipe.
Input data formats
Text sources
data.jsonl{"text": "This is a sentence."}
{"text": "This is another sentence.", "meta": {"score": 0.1}}
data.json[
{ "text": "This is a sentence." },
{ "text": "This is another sentence.", "meta": { "score": 0.1 } }
]
data.csvText,Label,Meta
This is a sentence.,POSITIVE,0
This is another sentence.,NEGATIVE,0.1
Column headers can be lowercase or title case. Columns for label and meta are
optional. The value of the meta column will be added as a "meta"
key within
the "meta"
dictionary, e.g. {"text": "...", "meta": {"meta": 0}},
.
data.txtThis is a sentence.
This is another sentence.
Comparison files
Each entry in a comparison file needs to include an output
key containing the
annotation example data – for example, the text, a text and an entity span or an
image. Optionally, you can also include an input
key for the baseline
annotation. The id
is used to combine the examples from each file. If an ID is
only present in one file, the example is skipped.
model_a.jsonl{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Natural Language Processing"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "dog"}}
model_b.jsonl{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Neuro-Linguistic Programming"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "hound"}}
Match patterns
Match patterns can be used in recipes like ner.manual
,
textcat.teach
or match
to filter out specific entities you’re
interested in – for example, to collect training data for a new entity type. You
can also use the terms.to-patterns
recipe to convert a dataset of seed
terms to a JSONL pattern file.
Each entry should contain a "label"
and "pattern"
key. A pattern can be an
exact string, or a
rule-based token pattern
(used by spaCy’s Matcher
class), consisting of a list of dictionaries, each
describing one individual token and its attributes. When using token patterns,
keep in mind that their interpretation depends on the model’s tokenizer.
patterns.jsonl{"label": "FRUIT", "pattern": [{"lower": "apple"}]}
{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}
{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}
Here are some examples of match patterns and the respective matched strings. For more details, see the spaCy documentation on rule-based matching.
Pattern | Matches |
---|---|
[{"lower": "apple"}] | “apple”, “APPLE”, “Apple”, “ApPlLe” etc. |
[{"text": "apple"}] | “apple” |
[{"lower": "squash", "pos": "NOUN"}] | “squash”, “Squash” etc. (nouns only, i.e. not “to squash”) |
"Lamb's lettuce" | “Lamb’s lettuce” |
Images
Images can be loaded from a URL or base64 data URI via any of the data formats
that support keyed inputs or from a directory of files. See the details on
file loaders for how to load images from a directory and the
image
and image_manual
docs for details on the expected JSON
format.
Audio and video New: 1.10
Audio and video data can be loaded from a URL or base64 data URI via any of the
data formats that support keyed inputs or from a directory of files. See the
details on file loaders for how to load images from a directory
and the audio
and audio_manual
docs for details on the
expected JSON format.
File loaders
Out-of-the-box, Prodigy currently supports loading in data from single files of
JSONL, JSON, CSV or plain text. You can specify the loader via the --loader
argument on the command line. If no loader is set, Prodigy will use the file
extension to pick the respective loader. Loaders are available via
prodigy.components.loaders
.
ID | Component | Description |
---|---|---|
jsonl | JSONL | Stream in newline-delimited JSON from a file. Prodigy’s preferred format, as it’s flexible and doesn’t require parsing the entire file. |
json | JSON | Stream in JSON from a file. Requires loading and parsing the entire file. |
csv | CSV | Stream in a CSV file using the csv module. The keys will be read off the headers in the first line. Supports an optional delimiter keyword argument. |
txt | TXT | Stream in a plain text from a file containing one example per line. Will yield tasks containing only a text property. |
images | Images | Stream in images from a directory. All images will be encoded as base64 data URIs and included as the image key to be rendered with the image or image_manual interface. |
image-server | ImageServer | New: 1.9.4 Stream in images from a directory. Image files will be served via the Prodigy server and their data won’t be included with the task. |
audio | Audio | New: 1.10 Stream in audio files from a directory. All files will be encoded as base64 data URIs and included as the audio key to be rendered with the audio or audio_manual interface. |
audio-server | AudioServer | New: 1.10 Stream in audio files from a directory. Audio files files will be served via the Prodigy server and their data won’t be included with the task. |
video | Video | New: 1.10 Stream in video files from a directory. All files will be encoded as base64 data URIs and included as the video key to be rendered with the audio or audio_manual interface. |
video-server | VideoServer | New: 1.10 Stream in video files from a directory. Video files files will be served via the Prodigy server and their data won’t be included with the task. |
Examplefrom prodigy.components.loaders import JSONL, JSON, CSV, TXT, Images, ImageServer
jsonl_stream = JSONL("path/to/file.jsonl")
json_stream = JSON("path/to/file.json")
csv_stream = CSV("path/to/file.csv", delimiter=",")
txt_stream = TXT("path/to/file.txt")
img_stream = Images("path/to/images")
img_stream2 = ImageServer("path/to/images")
Example
prodigy ner.manual your_dataset en_core_web_sm /tmp/your_data.dump --loader txt --label PERSON,ORG
Media loader APIs
The Images
, ImageServer
, Audio
and AudioServer
loaders all follow the
same API and accept a path to a directory of image and optional file extensions.
Argument | Type | Description |
---|---|---|
f | str | Path to directory of images. |
file_ext | tuple | New: 1.10 File extensions to load. All other files will be ignored. Also the default file extensions |
YIELDS | dict | The annotation tasks with the loaded data. |
As of v1.10, Prodigy also exposes more generic Base64
and Server
loaders
that can be used to implement loading for other file types.
Argument | Type | Description |
---|---|---|
f | str | Path to directory of files. |
input_key | str | The key of the task dict to assign the string or URL to, e.g. "image" or "audio" . |
file_ext | tuple | File extensions to consider. If None (default), all files in the directory will be loaded. |
YIELDS | dict | The annotation tasks with the loaded data. |
Default media file extensions
Images | (".jpg", ".jpeg", ".png", ".gif", ".svg") |
Audio | (".mp3", ".m4a", ".wav") |
Video | (".mpeg", ".mpg", ".mp4") |
Fetching images from local paths and URLs
You can also use the
fetch_media
preprocessor to replace all
local paths and URLs in your stream with base64 data URIs. The skip
keyword
argument lets you specify whether to skip invalid files that can’t be converted
(for example, because the path doesn’t exist, or the URL can’t be fetched). If
set to False
, Prodigy will raise a ValueError
if it encounters invalid
files.
from prodigy.components.preprocess import fetch_media
stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_media(stream, "image", skip=True)
Corpus loaders
Additionally, Prodigy also supports converting data from popular data sets and corpora.
ID | Component | Description |
---|---|---|
reddit | Reddit | Stream in examples from a file of the Reddit corpus. Will extract, clean and validate the comments. |
Examplefrom prodigy.components.loaders import Reddit
reddit_stream = Reddit("path/to/reddit.bz2")
Loading from existing datasets New: 1.10
The dataset:
syntax lets you specify an existing dataset as the input source.
Prodigy will then load the annotations from the dataset and stream them in
again. Annotation interfaces respect pre-defined
annotations and will pre-select them in the UI. This is useful if you want to
re-annotate a dataset to correct it, or if you want to add new information with
a different interface. The following command will stream in annotations from the
dataset ner_data
and save the resulting reannotated data in a new dataset
ner_data_new
:
Example
prodigy ner.manual ner_data_new blank:en dataset:ner_data --label PERSON,ORG
Optionally, you can also add another :
plus the value of the answer to load if
you only want to load examples with specific answers like "accept"
or
"ignore"
. For example, you may want to re-annotate difficult questions you
previously skipped by hitting ignore. Similarly, if you’re using
rel.manual
to assign relations to pre-annotated spans, you typically only
want to load in accepted answers.
Example
prodigy rel.manual ner_data blank:en dataset:ner_data:accept --label SUBJECT,OBJECT
Loading from standard input
If the source
argument on the command line is set to -
, Prodigy will read
from sys.stdin
. This lets you pipe data forward. If you’re loading data in a
different format, make sure to set the --loader
argument on the command line
so Prodigy knows how to interpret the incoming data.
cat
./your_data.jsonl
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-
--loader jsonl
Loading text files from a directory or custom format
A custom loader should be a function that loads your data and yields dictionaries in Prodigy’s JSON format. If you’re writing a custom recipe, you can implement your loading in your recipe function:
recipe.pypseudocode @prodigy.recipe("custom-recipe-with-loader")
def custom_recipe_with_loader(dataset, source):
stream = load_your_source_here(source) # implement your custom loading
return {"dataset": dataset, "stream": stream, "view_id": "text"}
Using custom loaders with built-in recipes
If you want to use a built-in recipe like ner.manual
but load in data
from a custom source, there’s usually no need to copy-paste the recipe script
only to replace the loader. Instead, you can write a loader script that outputs
the data, and then pipe that output forward. If the source
argument on the
command line is set to -
, Prodigy will read from sys.stdin
:
python
load_data.py
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-
All your custom loader script needs to do is load the data somehow, create
annotation tasks in Prodigy’s format (e.g. dictionary with a "text"
key) and
print the dumped JSON.
load_data.pypseudocode from pathlib import Path import json
data_path = Path("/path/to/directory")
for file_path in data_path.iterdir(): # iterate over directory
lines = Path(file_path).open("r", encoding="utf8") # open file
for line in lines:
task = {"text": line} # create one task for each line of text
print(json.dumps(task)) # dump and print the JSON
This approach works for any file format and data type – for example, you could also load in data from a different database or via an API. For extra convenience, you can also wrap your loader in a custom recipe and have Prodigy take care of adding the command-line interface. If a custom recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the code.
load_data.pypseudocode @prodigy.recipe("load-data") # add argument annotations and shortcuts if needed
def load_data(dir_path):
# the loader code here
You can then use your custom loader like this:
prodigy
load_data
/path/to/directory
-F load_data.py
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-
Hashing and deduplication
When a new example comes in, Prodigy assigns it two hashes: the input hash
and the task hash. Both hashes are uint32 values, so they can be stored as
JSON with each task. Based on those hashes, Prodigy is able to determine whether
two examples are entirely different, different questions about the same input,
e.g. text, or the same question about the same input. For more details on how
the hashes are generated and how to set custom hashes, see the
set_hashes
docs.
Hash | Type | Description |
---|---|---|
_input_hash | uint32 | Hash representing the input that annotations are collected on, e.g. the "text" , "image" or "html" . Examples with the same text will receive the same input hash. |
_task_hash | uint32 | Hash representing the “question” about the input, i.e. the "label" , "spans" or "options" . Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes. |
As of v1.9, recipes can return an optional
"exclude_by"
setting in their "config"
to specify whether to exclude by
"input"
or "task"
(default). Filtering and excluding by input hash is
especially useful for manual and semi-manual workflows like ner.manual
and ner.correct
. If you’ve already annotated an example and it comes in
again with suggestions from a model or pattern, Prodigy will correctly determine
that it’s a different “question”. However, unlike in the binary workflows, you
typically don’t want to see the example again, because you already created a
gold-standard annotation for it.
Live APIs
API loaders are similar to file format loaders, but stream in content via a web API – for example, news headlines or teasers for a topic or from a specific publication, images for a search term or related tags. Individual APIs differ in the type of content they provide and the respective rate limit restrictions. All APIs supported by Prodigy come with a free license option and should provide sufficient rate limits for use on a single machine.
API loaders are available via prodigy.components.loaders
and using them
requires an entry for the loader ID in the "api_keys"
in your prodigy.json
.
The value of the source
argument on the command line is used as the API query.
ID | Loader | Description |
---|---|---|
nyt | NewYorkTimes | The New York Times API. |
guardian | Guardian | The Guardian API. |
zeit | Zeit | Die Zeit API (German). |
newsapi | NewsAPI | News API |
twitter | Twitter | Twitter API. Requires the API key to be a dict with consumer_key , consumer_secret , access_token and access_token_secret . |
tumblr | Tumblr | Tumblr API. Returns images. |
github | GitHub | GitHub API. Doesn’t require an API key. |
unsplash | Unsplash | Unsplash API. Returns images. |