Data production environment to handle multiple production cycles. It provides a file system structure and a set of Python scripts. Within each production cycle, data can be automatically generated using Snakemake and legend-dataflow.
Installation of Snakemake:
- source
setup.shto set some environmental variables - run
prodenv-tools
Creation of a new production cycle:
- source
setup.shto set some environmental variables - run
dataprod-init-cycleto initialize a new production cycle - customize the
config.jsonfile in the production cycle directory - check-out specific version of
pygama,pyfcutils,legend-dataflow,legend-metadata - run
dataprod-install-swto install the software invenv - run
snakemaketo populate the multi-tier data structure
Workflow for existing production cycles:
- source
setup.shto set some environmental variables - customize
pygama,pyfcutils,legend-dataflow-hades,legend-metadata - run
dataprod-install-swto reinstall the software - remove all files in
gen/andgenpar/that need to be reprocessed - run
snakemaketo update the multi-tier data structure
$ source setup.sh
Sourcing the setup.sh file located at the top level of the production environment. Sourcing the file will:
- set data production environmental variables (the name of all variables start with
PRODENV) - add
./bin/and./tools/bin/to PATH, making scripts and tools available from command line
The content of the source file can also be copied to the users's shell configuration file.
$ dataprod-init-cycle -h
usage: dataprod-init-cycle [-h] [-c] rpath
Initialize a new production cycle
positional arguments:
rpath relative path of directory in which the production cycle will be created
options:
-h, --help show this help message and exit
-c clone pygama and pylegendmetaThe only mandatory option of the script is rpath, i.e. the path to the
production cycle directory. The scripts generates a file-system structure under
./rpath/ and, by default, it clones:
legend-dataflowunder./rpath/dataflowlegend-metadataunder./rpath/inputspygamaunder./prod-usr/prod_tag/src/python/pygamapyfcutilsunder./prod-usr/prod_tag/src/python/pyfcutils
By default, all packages are downloaded from the legend-exp organization and
set to the main branch.
When the option -c is specified, pygama and pyfcutils are downloaded. The
path to the custom software directory is stored in config.json. The custom
directory will contain a pygama and pyfcutils folder.
The structure of the production cycle is:
.
├── config.json
├── dataflow
├── generated
│ ├── log
│ ├── par
│ ├── plt
│ ├── tier
│ └── tmp
├── inputs
└── software
config.jsoncontains paths to all main directories of the data production anddataflowcontains the Snakemake configuration files. This repository can be edited to modify the data flowgeneratedand subdirectories are automatically generated during the data productionsoftwarecontains the software used for data production. Users can edit these repositories.
$ dataprod-install-sw -h
usage: dataprod-install-sw [-h] [-r] config_file
Install user software in data production environment
positional arguments:
config_file production cycle configuration file
optional arguments:
-h, --help show this help message and exit
-r remove software directory before installing software
This script loads the container and pip-installs pygama and pyfcutils. The
option -r can be used to fully remove the installation directory before the
software is re-installed.
$ dataprod-load-sw -h
usage: dataprod-load-sw [-h] config_file
Load data production environment
positional arguments:
config_file production cycle configuration file
optional arguments:
-h, --help show this help message and exitIt loads the container and all the software installed. Type exit to quit.
Data can be automatically produced through commands such as:
$ snakemake \
--snakefile path-to-dataflow-dir/Snakefile \
-j 20 \
--configfile=path-to-cycle/config.json \
all-B00000B-co_HS5_top_dlt-tier2.genDocumentation on how to run snakemake is available at legend-dataflow.
Contact matteo.agostini@ucl.ac.uk for support and report bugs