This document describes the artifacts and user commands involved in the life cycle of a DeepDive application.
There are largely three groups of artifacts involving a DeepDive application.
First of all, DeepDive application code defines the data schema, the data transformations and dependencies between them, and how the transformed data maps to the statistical model. DeepDive distinguishes source code from compiled executable code.
The source code refers to the rules written by the user in DDlog and DeepDive configuration syntax as well as the UDF programs written in Python or the user's language of choice.
DDlog code is expected to be kept in app.ddlog
, and all configuration blocks must be present in deepdive.conf
.
UDF code is expected to be kept under udf/
.
In order to run the application, DeepDive compiles the user's source code into code that is readily executable, such as shell scripts and Makefiles.
The compiled code is kept under run/process/
.
DeepDive applications transform input data to construct a machine learning model.
A DeepDive application has a collection of input data that falls into the following two classes:
- Raw unstructured/semi-structured data to extract structured data from, such as text corpora, tables, diagrams, and images;
- Structured data used by the code to drive the extraction, such as dictionaries, ontologies, and existing (incomplete) knowledge bases.
In either case, the serialized form of the input data or the executable that generates it is expected to be kept under input/
.
DeepDive assumes all processed data as well as input data are accessible through a relational database. Whether the data is stored in an actual RDBMS is not important. What matters is the fact that all data DeepDive touches must have a clear relational schema. The current working database is where all input data will be read from and all processed data will be stored to. All data transformations and model generation driven by the code mutate the current working database.
The last modified timestamp of every relation in the current working database is kept under run/data/
.
DeepDive provides a way to record the state of the current working database and switch back to a particular point in time if it's recorded as a snapshot. DeepDive snapshots heavily rely on PostgreSQL's "schema" support, and other database drivers may not support snapshots efficiently, resulting in full backups and restores.
A DeepDive application ultimately constructs a machine learning model.
Similar to the current working database, a DeepDive application has a current working model to which all relevant operations are applied, e.g., grounding, learning/training, and inference/prediction/testing.
Similar to snapshots, the current working model can be saved when there's a need to keep the current working model.
Compiles source code of DeepDive application into executable code.
Most compile-time error checks against deepdive.conf
and app.ddlog
are done at this step.
The compiled code is kept under the path run/process/
.
Each extractor defined either directly in deepdive.conf
or implicitly in app.ddlog
is compiled as an individually executable, standalone program that can be run under its own working directory.
The dependency between these programs are encoded as GNU Make rules and targets in run/CURRENT/Makefile
that also keeps track of their last executed timestamps.
Generates skeleton source code for functions (UDFs) declared in DDlog under udf/
.
Checks source code for given function declared in DDlog with synthetic data.
TODO snapshot -> db, stage, ?