Database
By default, Prodigy uses SQLite to store annotations in a simple database file in your Prodigy home directory. If you want to use the default database with its default settings, no further configuration is required and you can start using Prodigy straight away. Alternatively, you can choose to use Prodigy with MySQL or PostgreSQL, or write your own custom recipe to plug in any other storage solution.
Prodigy uses the peewee
package to
manage the database integration. This gives you a lot of flexibility in terms of
setup and debugging, and allows you to use more advanced features via the
Playhouse extension.
Prodigy uses the database to store all annotations by project and annotation session. Even if you’re only using Prodigy with a model in the loop, you’ll usually want a record of the collected annotations as a backup, or to use them as evaluation data. Prodigy is also very powerful without a model in the loop – for example to bootstrap word lists or collect feedback on the output of two models and generate evaluation data for machine translation or image classification.
No! The database only exists to store collected annotations. You can read in the raw data straight from a file or a custom source. As your data comes in, Prodigy will assign hashes to the examples, so it’ll always be able to tell whether an example has been annotated already or not.
If you have existing annotations, you can convert them to
Prodigy’s format for the given task and use the
db-in
command to import them to the database.
Setup
SQLite
The default database option that stores all annotations in a flat SQLite database file. Unless otherwise specified, the database is created in the Prodigy home directory.
prodigy.json{
"db": "sqlite",
"db_settings": {
"sqlite": {
"name": "prodigy.db",
"path": "/custom/path"
}
}
}
Setting | Description | Default |
---|---|---|
name | Database file name. To only store the database in memory, use :memory: . | prodigy.db |
path | Path to directory containing database file. | PRODIGY_HOME |
… | sqlite3 connection parameters. | - |
MySQL
The settings for a MySQL database can take any MySQLdb or PyMySQL connection parameters, depending on your database driver.
prodigy.json{
"db": "mysql",
"db_settings": {
"mysql": {
"host": "localhost",
"user": "username",
"passwd": "xxx",
"db": "prodigy"
}
}
}
PostgreSQL
The settings for a PostgreSQL database can take any
psycopg2
connection parameters.
prodigy.json{
"db": "postgresql",
"db_settings": {
"postgresql": {
"dbname": "prodigy",
"user": "username",
"password": "xxx"
}
}
}
Permissions and user privileges
When setting up your database integration, you might not want to give your user permission to perform all operations. Prodigy uses the following operations, some of which are optional and only required for certain commands or initial setup:
Operation | Required | Details |
---|---|---|
SELECT | Retrieving datasets and annotations. | |
INSERT | Adding datasets and annotations. | |
UPDATE | Updating datasets and annotations. | |
DELETE | Deleting datasets. Only used for the prodigy drop command, so permission can be omitted if you don’t need this feature or prefer to delete records manually. | |
CREATE | Creation of tables Dataset , Example and Link . Not required if you create them manually. |
To test your database connection, you can also write a simple Python script that connects to Prodigy’s database and performs the most important operations. For more details, check out this thread on the forum.
test_database.pyfrom prodigy.components.db import connect
examples = [{"text": "hello world", "_task_hash": 123, "_input_hash": 456}]
db = connect() # uses settings from prodigy.json
db.add_dataset("test_dataset") # add dataset
assert "test_dataset" in db # check that dataset was added
db.add_examples(examples, ["test_dataset"]) # add examples to dataset
dataset = db.get_dataset("test_dataset") # retrieve a dataset
assert len(dataset) == 1 # check that examples were added
Tables
Here are the tables Prodigy creates and how they map to the annotations you collect. You typically shouldn’t have to interact with the database and its tables directly.
Table | Description |
---|---|
Dataset | The dataset / session IDs and meta information. |
Example | The individual annotation examples. Each example is only added once, so if you add the same annotation to multiple datasets, it’ll only have one record here. |
Link | Example IDs linked to datasets. This is how Prodigy knows which examples belong to which datasets and sessions. |
Importing existing data
To use existing annotations collected with other tools in Prodigy, you can
import them via the db-in
command. You can import data of all file
formats supported by Prodigy. However, JSON or JSONL is
usually recommended, as it gives you more flexibility. By default, all examples
will be set to "answer": "accept"
. You can specify a different answer using
the --answer
argument on the command line.
prodigy
db-in
new_dataset
/path/to/data.jsonl
Imported 1550 annotations to 'new_dataset'.
Database class
Prodigy’s database model is included with the library and is available via
prodigy.components.db
.
Examplefrom prodigy.components.db import connect
db = connect()
all_dataset_names = db.datasets
examples = db.get_dataset("my_dataset")
connect function
Prodigy provides a simple connection helper that takes care of connecting to one
of the built-in database options using the database ID and database settings. If
no database config is provided, it will be read off the prodigy.json
settings,
and default to 'sqlite'
with the standard settings.
Argument | Type | Description |
---|---|---|
db_id | str | ID of database, i.e. 'sqlite' , 'postgresql' or 'mysql' . Defaults to 'sqlite' . |
db_settings | dict | Database-specific settings. If not provided, settings will be read off the prodigy.json . |
RETURNS | Database | The database. |
from prodigy.components.db import connect
db = connect("sqlite", {"name": "my_db.db"})
Database.__init__ method
Argument | Type | Description |
---|---|---|
db | peewee.Database | peewee database. Will be available as the db attribute of the database. |
display_id | str | ID of database, e.g. 'sqlite' . Will be available as the db_id attribute of the database. For custom databases plugged in by the user, the ID will default to 'custom' . |
display_name | str | Human-readable name of the database, e.g. 'SQLite' . Will be available as the db_name attribute. For custom database module, the display name will defaults to the function name, class name, or repr(db) . |
RETURNS | Database | The database. |
To plug in custom database, you can initialize the Database
class with a
custom instance of peewee.Database
or its extension package
Playhouse, for
example:
import prodigy
from prodigy.components.db import Database
from playhouse.postgres_ext import PostgresqlExtDatabase
psql_db = PostgresqlExtDatabase("my_database", user="postgres")
db = Database(psql_db, "postgresql", "Custom PostgreSQL Database")
@prodigy.recipe("recipe-with-custom-db")
def recipe_with_custom_db():
return {"db": db}
Database.__len__ method
Argument | Type | Description |
---|---|---|
RETURNS | int | The number of datasets in the database. |
Database.__contains__ method
Argument | Type | Description |
---|---|---|
name | str | Dataset ID. |
RETURNS | bool | Whether the dataset exists in the database. |
Database.datasets property
Argument | Type | Description |
---|---|---|
RETURNS | list | A list of all dataset IDs in the database (excluding sessions). |
Database.sessions property
Argument | Type | Description |
---|---|---|
RETURNS | list | A list of all session IDs in the database (excluding user-created datasets). |
Database.reconnect method
Reconnect to the database. Called on API requests to avoid timeout issues, especially with MySQL. If the database connection is still open, it will be closed before reconnecting.
Database.close method
Close the database connection (if not already closed). Called after API requests to avoid timeout issues, especially with MySQL.
Database.get_dataset method
Argument | Type | Description |
---|---|---|
name | str | The dataset ID. |
RETURNS | list | A list of all examples in the dataset. |
Database.get_meta method
Argument | Type | Description |
---|---|---|
name | str | The dataset ID. |
RETURNS | dict | The meta data associated with the dataset. |
Database.get_examples method
Argument | Type | Description |
---|---|---|
ids | list | A list of task IDs. |
by | str | Select by field, defaults to task_hash . |
RETURNS | list | The examples with the given IDs. |
Database.get_input_hashes method
Argument | Type | Description |
---|---|---|
*names | list | Dataset IDs. |
RETURNS | set | A set of input hashes for examples of the given datasets. |
Database.get_task_hashes method
Argument | Type | Description |
---|---|---|
*names | list | Dataset IDs. |
RETURNS | set | A set of task hashes for examples of the given datasets. |
Database.add_dataset method
Argument | Type | Description |
---|---|---|
name | str | The dataset ID. |
meta | dict | The dataset meta. |
session | bool | Whether dataset is a session. |
RETURNS | list | The dataset. |
Database.add_examples method
Argument | Type | Description |
---|---|---|
examples | list | A list of annotated examples. |
datasets | list | A list of datasets to add the examples to. |
Database.link method
Argument | Type | Description |
---|---|---|
name | str | The dataset ID. |
example_ids | list | IDs of examples to link to the dataset. |
Database.unlink method
Argument | Type | Description |
---|---|---|
example_ids | list | IDs of examples to unlink. |
Database.drop_dataset method
Argument | Type | Description |
---|---|---|
name | list | The dataset ID. |
batch_size | int | Delete examples in batches of the given size. |
RETURNS | bool | Whether the dataset was successfully deleted. |
Implementing a custom database
Custom recipes let you return an optional "db"
component. If it’s not set, it will default to the database specified in your
prodigy.json
or to "sqlite"
. The database plugged in via a custom recipe can
also be False
(to not use any DB) or a custom class that follows Prodigy’s
Database
API.
@prodigy.recipe('custom-recipe')
def custom_recipe():
return {'db': YourCustomDB} # etc.
Essentially, all your custom class needs to do is expose methods to add and retrieve datasets and annotated examples. For instance:
class YourCustomDB(object):
def __init__(self, *args, **kwargs):
# initialize your custom database
def get_dataset(self, name, default=None):
# get examples for a given dataset name
# other methods and properties
How your database handler resolves those queries is entirely up to you. If your
database class implements the methods reconnect
and close
, Prodigy will call
those on each request to the REST API, allowing you to explicitly manage the
connection and prevent timeouts between requests.