Usage

Dependencies and Relations New: 1.10

The relations interface can be used for a wide range of classic natural language processing tasks, such as syntactic and semantic dependency parsing, coreference resolution, or discourse analysis. Relations can be directed or undirected, labelled or unlabelled, and anchored either by single words or phrases. Phrases can be recognised either as a preprocess, or jointly during the relations annotation.

Quickstart

The dep.correct recipe lets you stream in the model’s predictions and correct them if needed. You can either annotate all available dependency labels, or focus on a subset of them that you care most about for your specific application. spaCy can be updated with complete parses, as well as incomplete annotations.

Once you’ve create a dataset, you can use the train recipe to update the existing model with the annotations. You can also use the data-to-spacy command to convert your annotations to JSON-formatted training data to use with spacy train.

Prodigy represents dependency annotations in a simple JSON format with a "text", a "relations" property describing the head and child indices and label of each dependency relation, and a list of "tokens". So you could extract the suggestions from your model in this format, and then use the mark recipe with --view-id relations to label the data exactly as it comes in.

You can also write a custom recipe with a custom stream (a regular Python generator!) to plug in your model. If you can load and run your model in Python, you can use it with Prodigy. See the section on custom models for an example. If you want to use active learning with a custom model, you can make your recipe return an update callback that’s called whenever a new batch of answers is sent back from the web app. However, there are a few more considerations here, like how sensitive your model is to small updates.

The rel.manual recipe lets you switch between two annotation modes: one for labeling/correcting entity spans, and one for defining relations between these spans and/or other tokens. You can also load in data that has been pre-annotated with entity spans in Prodigy’s format and annotate relations between them, or use a pretrained model to suggest entities for you. For details, see the section on annotating named entity relations.

The coref.manual recipe incorporates default settings for coreference annotation, such as using the relation label COREF and disabling all tokens that are not nouns, proper nouns or pronouns, and pre-highlighting named entities. You can customize the labels it uses to match your language and model. For details, see the section on annotation coreference relations.

Prodigy’s rel.manual recipe allows building very powerful custom workflows for semi-automated dependency and relation annotation, mixing manual span labelling and dependency attachment. You can also provide match patterns to pre-select and merge spans, or to disable tokens that you know are not going to be part of a relation you’re looking for.

The example in this usage guide focuses on annotating biomedical events from literature, a complex task with multiple annotation objectives using the BioNLP 2011 GENIA Shared Task annotation scheme. A similar strategy can be applied to a variety of custom domain use cases.


Choosing the right recipe and workflow

So you have a problem that requires data annotated with relationships between words and expressions and you want to get it done as efficiently as possible. But how do you pick the right workflow for your use case?

  1. Fully manual: This is the classic approach. You’re shown all tokens in the text and you annotate labelled relations between them by clicking on them. The biggest challenge here is to prevent the process from getting too messy and tedious (and as a result, slower and more error-prone). Unless your goal is to create a new dependency treebank from scratch, you typically want to use at least some automation to merge phrases, disable irrelevant tokens or pre-label some of the data for you. In Prodigy, you can use the rel.manual recipe for manual relation annotation, or the more task-specific coref.manual with pre-defined configurations for coreference annotation.

  2. Manual with suggestions from model: If you already have a model that predicts something, you can use it to pre-label the data for you, and only correct its mistakes. This is especially useful for dependency parsing, where the data creation from scratch would otherwise be very tedious. Prodigy’s dep.correct workflow lets you stream in syntactic dependencies predicted by the model and correct them manually to create gold-standard data. You can also use a model with rel.manual to add named entities and noun phrases.


Dependency Parsing

If you already have a pretrained spaCy model with a parser and you want to improve it on your own data, you can use the built-in dep.correct recipe. You don’t have to annotate all labels at the same time – it can also be useful to focus on a smaller subset of labels that are most relevant for your application. The following command will start the web server, stream in headlines from news_headlines.jsonl and provide the label options ROOT (root of the sentence), csubj (clausal subject), nsubj (nominal subject), dobj (direct object) and pboj (prepositional object). Those labels will vary depending on the label scheme the model was trained with.

Download news_headlines.jsonl

Recipe command

prodigy dep.correct deps_news en_core_web_sm ./news_headlines.jsonl --label ROOT,csubj,nsubj,dobj,pboj --update
This live demo requires JavaScript to be enabled.

In the annotation UI, you can now review the dependency parse and click on incorrectly predicted arcs to remove them, or add new dependencies by selecting the head token and then the child token to attach it to. The ROOT label is the only one that should be attached to itself. You can achieve this in the UI by double-clicking or double-tapping the token.

When you’re done with annotating, you can use the train recipe with the component parser to train a dependency parser, or use the data-to-spacy command to export JSON-formatted training data. You can also use db-out to export data in Prodigy’s JSON format and use it in a different process.

Training command

prodigy train parser dep_news en_core_web_lg --output ./model

Named Entity Relations

You might already know Prodigy’s features for annotating training data for named entity recognition. Using a workflow like ner.manual, you can stream in your data and highlight entity spans for a given set of labels. For example, here we’re labelling PERSON and GPE (geopolitical entity):

Recipe command

prodigy ner.manual ner_rels_ent blank:en ./data.jsonl --label PERSON,GPE
This live demo requires JavaScript to be enabled.

While spans can capture a lot of important information – like the concepts that are mentioned in the text – they can’t always capture relationships between them. This requires another layer of data that defines two words or phrases, typically a “head” and a “child”, and a label specifying the type of relationship. The rel.manual recipe allows you to stream in data that’s already pre-annotated with named entities. In this case, we’re setting dataset:ner_rels_ent, which will load the previously annotated data from the dataset ner_rels_ent. Entities annotated in this dataset will be shown as a merged unit, and we can assign relations between them and other tokens.

Recipe command

prodigy rel.manual ner_rels_dep blank:en dataset:ner_rels_ent --label SUBJECT,LOCATION --wrap
This live demo requires JavaScript to be enabled.

Instead of loading in a pre-annotated dataset, you can also use an existing pretrained model to add entities for you. Here we’re using the en_core_web_sm model and the original raw input data, and set the --add-ents flag to include entities found in the text. For more options and how to add custom preprocessing, see the section on custom relations.

Recipe command

prodigy rel.manual ner_rels_dep en_core_web_sm ./data.jsonl --label SUBJECT,LOCATION --add-ents --wrap

Joint entity and relation annotation

For some use cases, it makes sense to do entity and relation annotation at the same time. That’s especially true if the annotation decision for both spans and relations requires the same thought process, or if it’s difficult to separate both tasks. In that case, you can pass an additional --span-label argument to rel.manual defining the entity labels to assign. The interface now has two modes: the relation annotation mode to connect tokens and spans, and the span annotation mode to manually highlight and edit spans. To add a span, click and drag across the tokens, or hold down shift and click on the start and end token.

Recipe command

prodigy rel.manual ner_rels blank:en ./data.jsonl --label SUBJECT,LOCATION --span-label PERSON,GPE
This live demo requires JavaScript to be enabled.
  1. Choose the span highlighting mode by clicking the button. This will let you manually highlight spans or remove existing spans.

  2. Drag across the token “Obama” or click on it to assign it the label PERSON. Then select the label GPE (geopolitical entity) at the top and do the same for “Hawaii” and “New York”. To select multiple tokens, drag across them and they will turn green, indicating the selection is valid. If you make a mistake, click on the span and then on the button to remove it.

  3. Choose the relations mode by clicking the button. This will let you select tokens or spans and assign relations to them.

  4. Click “Obama” and then “born” to assign it the relation SUBJECT and do the same for “Obama” and “studied”. Then select the label LOCATION at the top and connect “born” and “Hawaii” and then “studied” and “New York”.

If you have a pretrained model that already predicts something, you can also set the --add-ents flag to pre-highlight entities suggested by the model for you. You can then delete incorrect spans or change their label and add missing spans if needed.

Recipe command

prodigy rel.manual ner_rels blank:en ./data.jsonl --label SUBJECT,LOCATION --span-label PERSON,GPE --wrap --add-ents

Coreference Resolution

Coreference resolution is the challenge of linking ambiguous mentions such as “her” or “that woman” back to an antecedent providing more context about the entity in question. You can use the built-in coref.manual recipe to manually create such links. This recipe allows you to focus on nouns, proper nouns and pronouns specifically, by disabling all other tokens. The following command will start the web server, stream in movie summaries from plot_summaries.jsonl and provide the label COREF to annotate coreference relations.

Download plot_summaries.jsonl

Recipe command

prodigy coref.manual coref_movies en_core_web_sm ./plot_summaries.jsonl --label COREF
This live demo requires JavaScript to be enabled.

The recipe will use the model to automatically detect potential candidates for a coreference relationship. You can customize the labels used for the extraction via the recipe arguments to match the model you’re using. You can also set up your very own custom relation annotation workflow by defining custom rules for spans and disabled tokens.

To annotate, click a word or phrase and next the word or phrase you want to connect it to. To remove an existing relationship, you can click its label. In the above example, two coreference relationships are already annotated: “her” → “Lindy” and “she” → “Lindy”. Other mentions of “she” and “her” in the sentence should be carefully annotated as either referring back to “Lindy” or “Azaria”. Each relation you annotate will be saved as an entry under the key "relations".

Single relation (example){
   "head": 8,
   "head_span": {"start": 38, "end": 41, "token_start": 8, "token_end": 8, "label": null},
   "child": 0,
   "child_span": {"start": 0, "end": 5, "token_start": 0, "token_end": 0, "label": "PERSON"},
   "label": "COREF",
}

Prodigy will record the direction of the relationship, from the “head” to the “child”. This is relevant for many tasks like syntactic dependency annotation, but less relevant for tasks like coreference resolution, where you mostly care about pairs of coreference relations. For this use case, you can just treat the "head" and "child" values of the relation as interchangable and just consider them as one coreference pair.


Custom dependencies and relations

Prodigy’s rel.manual recipe allows building very powerful custom workflows for semi-automated dependency and relation annotation. It’s based on the following philosophy:

  1. Dependencies should refer to consistent units. For example, relations might refer to named entities predicted by a named entity recognizer, or noun phrases extracted using part-of-speech tags and syntactic dependency labels. You shouldn’t have to ask your annotator to label all of this from scratch, if you can automate it – instead, they should only have to correct mistakes. Using a pretrained model and rules to pre-highlight spans to annotate can make data creation faster and more consistent.

  2. Not all tokens are relevant. For instance, for many tasks, punctuation (outside of entities) is never going to be part of a relation you’re annotating. If you’re annotating nominal coreference, you only need to focus on nouns, proper nouns and pronouns. Or maybe you only want to annotate relations between entity spans and ignore all other tokens. Disabling irrelevant tokens automatically lets you and your annotators focus on what matters, speeds up the process and prevents mistakes.

You can customize your workflow using the following recipe settings:

--labelRelation label(s) to annotate manually in relation annotation mode .
--span-labelSpan label(s) to annotate manually in span annotation mode .
--patternsMatch patterns defining spans to be added.
--disable-patternsMatch patterns defining tokens to disable.
--add-entsAdd entities predicted by the model as spans.
--add-npsAdd noun phrases based on tagger and parser, if rules are available.

Example: Custom biomedical relation annotation

Annotating biomedical events from literature is a complex task, and serves as a good example for the relation annotation functionality. Here, we follow the annotation scheme from the BioNLP 2011 GENIA Shared Task, which has been the foundation of many bio-event extraction algorithms in the last decade and has become a de facto standard. The annotation process involves the following:

  • Annotate spans of one or more tokens describing genes and gene products (“GGPs”). In the original Shared Task, these were provided as gold annotations.
  • Identify trigger words or spans like “stabilizes” referring to a positive regulation. There are 9 different relation/event types: gene expression, transcription, protein catabolism, phosphorylation, localization, binding, regulation, positive regulation and negative regulation.
  • Connect trigger words to at least the object of the event, also called “theme”, which is usually a GGP. Binding events can have multiple theme annotations.
  • Connect regulation events to the subject of the event, also called “cause”, if available. For the regulation events, both the “theme” and “cause” arguments can be GGPs or other events, thus allowing a nested structure of events.

For this example, we have prepared a sample of 200 sentences in bio_events.jsonl, taken from the Shared Task. As the Shared Task came with gold-standard annotations of genes and proteins, we have already added those GGP spans to the input text. We also know that we’re only interested in nouns, proper nouns, verbs and adjectives or other spans that have been pre-tagged as GGP. So we can write a disable pattern that disables all tokens that do not have those part-of-speech tags and are also not part of the pre-labelled GGP spans.

patterns_disable_bio_rel.jsonl{"pattern": [{"POS": {"NOT_IN": ["NOUN", "PROPN", "VERB", "ADJ"]}, "_": {"label": {"NOT_IN": ["GGP"]}}}]}

The following command will start the web server, stream in the biomedical sentences from bio_events.jsonl, apply the disable rules, and provide a list of relevant span labels as well as the standard relations Cause (the subject of an event) and Theme (the object of an event).

Download bio_events.jsonl Download patterns_disable_bio_rel.jsonl

Recipe command

prodigy rel.manual rel_bio en_core_web_sm ./bio_events.jsonl --label Theme,Cause --span-label GGP,Gene_Expr,Transcr,Prot_Cat,Phosph,Loc,Bind,Reg,Reg+,Reg- --disable-patterns patterns_disable_bio_rel.jsonl --wrap
This live demo requires JavaScript to be enabled.
  1. The BioNLP ST’11 annotation scheme has “trigger words” that refer to the span of tokens that expresses a relation. For instance, “stabilizes” refers to a positive regulation. You can annotate it as such by going to the span annotation mode , selecting the label Reg+ and clicking on the token “stabilizes”. Similarly, annotate “enabling” as Reg+ and “activities” as Reg.

  2. Now select the relations mode by clicking the button. To annotate the first relation, select the relation type “Cause”, click on the trigger “stabilizes” to select it, then on the GGP “Mdmx” to define it as the subject (cause) of this event. Similarly, you can annotate “Mdm2” as being the object of this same event by also connecting it to the “stabilizes” trigger with the Theme relation type. This relation annotation style allows you to create nested events, as one event “e.g. activities of p53” could be the Theme of another event (“Mdm2 enabling it”).

The final annotated task should look like this:

This live demo requires JavaScript to be enabled.

In an end-to-end setting, you could predict the gene/protein mentions with an NER model trained for this challenge specifically, like for instance the models trained on scientific documents from the scispaCy project.


Using a custom model

You don’t need to use spaCy to let a model highlight suggestions for you. Under the hood, the concept is pretty straightforward: if you stream in examples with pre-defined "tokens", "relations" and optional "spans", Prodigy will accept and pre-highlight them. This means you can either stream in pre-labelled data, or write a custom recipe that uses your model to add tokens and relations to your data.

Expected format{
  "text": "I like cute cats",
  "tokens": [
    {"text": "I", "start": 0, "end": 1, "id": 0},
    {"text": "like", "start": 2, "end": 6, "id": 1},
    {"text": "cute", "start": 7, "end": 11, "id": 2},
    {"text": "cats", "start": 12, "end": 16, "id": 3}
  ],
  "relations": [
    {"child": 0, "head": 1, "label": "nsubj"},
    {"child": 1, "head": 1, "label": "ROOT"},
    {"child": 2, "head": 3, "label": "amod"},
    {"child": 3, "head": 1, "label": "dobj"}
  ]
}
This live demo requires JavaScript to be enabled.

For example, let’s say your model returns the syntactic dependencies as a list of token-based tags like ["nsubj", "ROOT", "amod", "dobj"] and head indices like [1, 1, 3, 1]. You could then generate your data like this and add a "head", "child" and "label" value for each relation:

Step 1: Write the stream generatorpseudocode 
def add_relations_to_stream(stream): custom_model = load_your_custom_model() for eg in stream: deps, heads = custom_model(eg["text"]) eg["relations"] = [] for i, (label, head) in enumerate(zip(deps, heads)): eg["relations"].append({"child": i, "head": head, "label": label}) yield eg

If you want to extract and add the dependencies at runtime, you can write a custom recipe that loads the raw data, uses your custom model to add "relations" to the stream, pre-tokenizes the text and then renders it all using the relations interface.

Step 2: Putting it all together in a recipepseudocode 
import prodigy from prodigy.components.loaders import JSONL from prodigy.components.preprocess import add_tokens import spacy @prodigy.recipe("custom-dep") def custom_dep_recipe(dataset, source): stream = JSONL(source) # load the data stream = add_relations_to_stream(stream) # add custom relations stream = add_tokens(spacy.blank("en"), stream) # add "tokens" to stream return { "dataset": dataset, # dataset to save annotations to "stream": stream, # the incoming stream of examples "view_id": "relations", # annotation interface to use "labels": ["ROOT", "nsubj", "amod", "dobj"] # labels to annotate }