Loaders and Input Data

Loaders are helper classes to turn a source file into an iterable stream. Calling a loader returns a generator that yields annotation tasks in Prodigy’s JSON format. Prodigy supports streaming in data from a variety of different formats, via the available loader components. To load data from other formats or sources, like a database or an API, you can write your own loader function that returns an iterable stream, and include it in your custom recipe.

Input data formats

Text sources

data.jsonl{"text": "This is a sentence."}
{"text": "This is another sentence.", "meta": {"score": 0.1}}

data.json[
  { "text": "This is a sentence." },
  { "text": "This is another sentence.", "meta": { "score": 0.1 } }
]

data.csvText,Label,Meta
This is a sentence.,POSITIVE,0
This is another sentence.,NEGATIVE,0.1

Column headers can be lowercase or title case. Columns for label and meta are optional. The value of the meta column will be added as a "meta" key within the "meta" dictionary, e.g. {"text": "...", "meta": {"meta": 0}},.

data.txtThis is a sentence.
This is another sentence.

Comparison files

Each entry in a comparison file needs to include an output key containing the annotation example data – for example, the text, a text and an entity span or an image. Optionally, you can also include an input key for the baseline annotation. The id is used to combine the examples from each file. If an ID is only present in one file, the example is skipped.

model_a.jsonl{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Natural Language Processing"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "dog"}}

model_b.jsonl{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Neuro-Linguistic Programming"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "hound"}}

Match patterns

Match patterns can be used in recipes like ner.manual, textcat.teach or match to filter out specific entities you’re interested in – for example, to collect training data for a new entity type. You can also use the terms.to-patterns recipe to convert a dataset of seed terms to a JSONL pattern file.

Each entry should contain a "label" and "pattern" key. A pattern can be an exact string, or a rule-based token pattern (used by spaCy’s Matcher class), consisting of a list of dictionaries, each describing one individual token and its attributes. When using token patterns, keep in mind that their interpretation depends on the model’s tokenizer.

patterns.jsonl{"label": "FRUIT", "pattern": [{"lower": "apple"}]}
{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}
{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}

Here are some examples of match patterns and the respective matched strings. For more details, see the spaCy documentation on rule-based matching.

Pattern	Matches
`[{"lower": "apple"}]`	“apple”, “APPLE”, “Apple”, “ApPlLe” etc.
`[{"text": "apple"}]`	“apple”
`[{"lower": "squash", "pos": "NOUN"}]`	“squash”, “Squash” etc. (nouns only, i.e. not “to squash”)
`"Lamb's lettuce"`	“Lamb’s lettuce”

Images

Images can be loaded from a URL or base64 data URI via any of the data formats that support keyed inputs or from a directory of files. See the details on file loaders for how to load images from a directory and the image and image_manual docs for details on the expected JSON format.

Audio and video New: 1.10

Audio and video data can be loaded from a URL or base64 data URI via any of the data formats that support keyed inputs or from a directory of files. See the details on file loaders for how to load images from a directory and the audio and audio_manual docs for details on the expected JSON format.

File loaders

Out-of-the-box, Prodigy currently supports loading in data from single files of JSONL, JSON, CSV or plain text. You can specify the loader via the --loader argument on the command line. If no loader is set, Prodigy will use the file extension to pick the respective loader. Loaders are available via prodigy.components.loaders.

ID	Component	Description
`jsonl`	`JSONL`	Stream in newline-delimited JSON from a file. Prodigy’s preferred format, as it’s flexible and doesn’t require parsing the entire file.
`json`	`JSON`	Stream in JSON from a file. Requires loading and parsing the entire file.
`csv`	`CSV`	Stream in a CSV file using the `csv` module. The keys will be read off the headers in the first line. Supports an optional `delimiter` keyword argument.
`txt`	`TXT`	Stream in a plain text from a file containing one example per line. Will yield tasks containing only a `text` property.
`images`	`Images`	Stream in images from a directory. All images will be encoded as base64 data URIs and included as the `image` key to be rendered with the `image` or `image_manual` interface.
`image-server`	`ImageServer`	New: 1.9.4 Stream in images from a directory. Image files will be served via the Prodigy server and their data won’t be included with the task.
`audio`	`Audio`	New: 1.10 Stream in audio files from a directory. All files will be encoded as base64 data URIs and included as the `audio` key to be rendered with the `audio` or `audio_manual` interface.
`audio-server`	`AudioServer`	New: 1.10 Stream in audio files from a directory. Audio files files will be served via the Prodigy server and their data won’t be included with the task.
`video`	`Video`	New: 1.10 Stream in video files from a directory. All files will be encoded as base64 data URIs and included as the `video` key to be rendered with the `audio` or `audio_manual` interface.
`video-server`	`VideoServer`	New: 1.10 Stream in video files from a directory. Video files files will be served via the Prodigy server and their data won’t be included with the task.

Examplefrom prodigy.components.loaders import JSONL, JSON, CSV, TXT, Images, ImageServer

jsonl_stream = JSONL("path/to/file.jsonl")
json_stream = JSON("path/to/file.json")
csv_stream = CSV("path/to/file.csv", delimiter=",")
txt_stream = TXT("path/to/file.txt")
img_stream = Images("path/to/images")
img_stream2 = ImageServer("path/to/images")

Example

prodigy
ner.manual
your_dataset
en_core_web_sm
/tmp/your_data.dump
--loader txt
--label PERSON,ORG

Media loader APIs

The Images, ImageServer, Audio and AudioServer loaders all follow the same API and accept a path to a directory of image and optional file extensions.

Argument	Type	Description
`f`	str	Path to directory of images.
`file_ext`	tuple	New: 1.10 File extensions to load. All other files will be ignored. Also the default file extensions
YIELDS	dict	The annotation tasks with the loaded data.

As of v1.10, Prodigy also exposes more generic Base64 and Server loaders that can be used to implement loading for other file types.

Argument	Type	Description
`f`	str	Path to directory of files.
`input_key`	str	The key of the task dict to assign the string or URL to, e.g. `"image"` or `"audio"`.
`file_ext`	tuple	File extensions to consider. If `None` (default), all files in the directory will be loaded.
YIELDS	dict	The annotation tasks with the loaded data.

Default media file extensions


Images	`(".jpg", ".jpeg", ".png", ".gif", ".svg")`
Audio	`(".mp3", ".m4a", ".wav")`
Video	`(".mpeg", ".mpg", ".mp4")`

Fetching images from local paths and URLs

You can also use the fetch_media preprocessor to replace all local paths and URLs in your stream with base64 data URIs. The skip keyword argument lets you specify whether to skip invalid files that can’t be converted (for example, because the path doesn’t exist, or the URL can’t be fetched). If set to False, Prodigy will raise a ValueError if it encounters invalid files.

from prodigy.components.preprocess import fetch_media

stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_media(stream, "image", skip=True)

Corpus loaders

Additionally, Prodigy also supports converting data from popular data sets and corpora.

ID	Component	Description
`reddit`	`Reddit`	Stream in examples from a file of the Reddit corpus. Will extract, clean and validate the comments.

Examplefrom prodigy.components.loaders import Reddit
reddit_stream = Reddit("path/to/reddit.bz2")

Loading from existing datasets New: 1.10

The dataset: syntax lets you specify an existing dataset as the input source. Prodigy will then load the annotations from the dataset and stream them in again. Annotation interfaces respect pre-defined annotations and will pre-select them in the UI. This is useful if you want to re-annotate a dataset to correct it, or if you want to add new information with a different interface. The following command will stream in annotations from the dataset ner_data and save the resulting reannotated data in a new dataset ner_data_new:

Example

prodigy
ner.manual
ner_data_new
blank:en
dataset:ner_data
--label PERSON,ORG

Optionally, you can also add another : plus the value of the answer to load if you only want to load examples with specific answers like "accept" or "ignore". For example, you may want to re-annotate difficult questions you previously skipped by hitting ignore. Similarly, if you’re using rel.manual to assign relations to pre-annotated spans, you typically only want to load in accepted answers.

Example

prodigy
rel.manual
ner_data
blank:en
dataset:ner_data:accept
--label SUBJECT,OBJECT

Loading from standard input

If the source argument on the command line is set to -, Prodigy will read from sys.stdin. This lets you pipe data forward. If you’re loading data in a different format, make sure to set the --loader argument on the command line so Prodigy knows how to interpret the incoming data.


cat
./your_data.jsonl
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-
--loader jsonl

Loading text files from a directory or custom format

A custom loader should be a function that loads your data and yields dictionaries in Prodigy’s JSON format. If you’re writing a custom recipe, you can implement your loading in your recipe function:

recipe.pypseudocode @prodigy.recipe("custom-recipe-with-loader")
def custom_recipe_with_loader(dataset, source):
    stream = load_your_source_here(source)  # implement your custom loading
    return {"dataset": dataset, "stream": stream, "view_id": "text"}

Using custom loaders with built-in recipes

If you want to use a built-in recipe like ner.manual but load in data from a custom source, there’s usually no need to copy-paste the recipe script only to replace the loader. Instead, you can write a loader script that outputs the data, and then pipe that output forward. If the source argument on the command line is set to -, Prodigy will read from sys.stdin:


python
load_data.py
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-

All your custom loader script needs to do is load the data somehow, create annotation tasks in Prodigy’s format (e.g. dictionary with a "text" key) and print the dumped JSON.

load_data.pypseudocode from pathlib import Path import json

data_path = Path("/path/to/directory")
for file_path in data_path.iterdir():  # iterate over directory
    lines = Path(file_path).open("r", encoding="utf8")  # open file
    for line in lines:
        task = {"text": line}  # create one task for each line of text
        print(json.dumps(task))  # dump and print the JSON

This approach works for any file format and data type – for example, you could also load in data from a different database or via an API. For extra convenience, you can also wrap your loader in a custom recipe and have Prodigy take care of adding the command-line interface. If a custom recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the code.

load_data.pypseudocode @prodigy.recipe("load-data")  # add argument annotations and shortcuts if needed
def load_data(dir_path):
    # the loader code here

You can then use your custom loader like this:


prodigy
load_data
/path/to/directory
-F load_data.py
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-

Hashing and deduplication

When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash. Both hashes are uint32 values, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input. For more details on how the hashes are generated and how to set custom hashes, see the set_hashes docs.

Hash	Type	Description
`_input_hash`	uint32	Hash representing the input that annotations are collected on, e.g. the `"text"`, `"image"` or `"html"`. Examples with the same text will receive the same input hash.
`_task_hash`	uint32	Hash representing the “question” about the input, i.e. the `"label"`, `"spans"` or `"options"`. Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes.

As of v1.9, recipes can return an optional "exclude_by" setting in their "config" to specify whether to exclude by "input" or "task" (default). Filtering and excluding by input hash is especially useful for manual and semi-manual workflows like ner.manual and ner.correct. If you’ve already annotated an example and it comes in again with suggestions from a model or pattern, Prodigy will correctly determine that it’s a different “question”. However, unlike in the binary workflows, you typically don’t want to see the example again, because you already created a gold-standard annotation for it.

Live APIs

API loaders are similar to file format loaders, but stream in content via a web API – for example, news headlines or teasers for a topic or from a specific publication, images for a search term or related tags. Individual APIs differ in the type of content they provide and the respective rate limit restrictions. All APIs supported by Prodigy come with a free license option and should provide sufficient rate limits for use on a single machine.

API loaders are available via prodigy.components.loaders and using them requires an entry for the loader ID in the "api_keys" in your prodigy.json. The value of the source argument on the command line is used as the API query.

ID	Loader	Description
`nyt`	`NewYorkTimes`	The New York Times API.
`guardian`	`Guardian`	The Guardian API.
`zeit`	`Zeit`	Die Zeit API (German).
`newsapi`	`NewsAPI`	News API
`twitter`	`Twitter`	Twitter API. Requires the API key to be a dict with `consumer_key`, `consumer_secret`, `access_token` and `access_token_secret`.
`tumblr`	`Tumblr`	Tumblr API. Returns images.
`github`	`GitHub`	GitHub API. Doesn’t require an API key.
`unsplash`	`Unsplash`	Unsplash API. Returns images.