Used to generate graph data based on a Gen3 data dictionary.
It is sometimes necessary to create simulated data when it is impractical to obtain real data. This is an important technique to generate data that can be used for building models or running services over datasets that may have protected information or may not be available for legal reasons. The functions in this simulation suite allow a user to:
- Simulate and validate data
- Organize simulated data by nodes in a data model and export to json for easy upload.
Data simulator contains various commands to help simulate, test, and validate data dictionaries. These commands are generally accessed via data-simulator
. However, if you are not managing your own virtual environment externally, you may need to prepend poetry run
to your commands, as is described in the poetry documentation here.
Additionally, make sure you use data-simulator with the most recent release of our services in order to ensure expected behavior. In the examples below, we use bhcdictonary
which, at time of writing, is on release 3.1.1
.
This function is very helpful for user to validate dictionary
data-simulator validate --url https://s3.amazonaws.com/dictionary-artifacts/bhcdictionary/<release_version>/schema.json
Required arguments:
- url: s3 dictionary link
Simulate the data using dictionary
data-simulator simulate --url https://s3.amazonaws.com/dictionary-artifacts/bhcdictionary/<release_version>/schema.json --path ./tests/TestData --program DEV --project test
Required arguments:
- url: s3 dictionary link
- path: path to save files to
- program
- project
Optional arguments:
- max_samples: maximum number of instances for each node. default is 1
- required_only: only simulate required properties
- random: randomly generate the numbers of node instances (up to
max_samples
). If this argument is not used, all nodes havemax_samples
instances - node_num_instances_file ./file.json: generate the numbers of node instances specified in the JSON file. The file should contain the number of instances (integer) to generate for each node name, for example:
{"submitted_unaligned_reads": 100}
.max_samples
instances are generated for nodes that are not specified in the file. - consent_codes: whether to include generation of random consent codes
Generate a submission order given a node name and a dictionary
data-simulator submission_order --url https://s3.amazonaws.com/dictionary-artifacts/bhcdictionary/<release_version>/schema.json --node_name case --path ./data-simulator/sample_test_data
Required arguments:
- url: s3 dictionary link
- path: path to save file to
Optional arguments:
- node_name: node to generate the submission order for. by default, the command selects a random data file node
- skip: skip raising an exception if gets an error
Submit the data via sheepdog api
data-simulator submitting_data --host http://devplanet.planx-pla.net --project DEV/test --dir ./data-simulator/sample_test_data --access_token_file ./token --chunk_size 10
Required arguments:
- dir: path containing data
- host
- project: program name and project code separated by a forward slash
- access_token_file
Optional arguments:
- chunk_size: default is 1
Poetry needs to be installed before installing data simulator. Please follow https://python-poetry.org/docs/#installation for installing poetry.
To install data simulator, run the following command.
poetry install -vv
poetry install -vv
poetry run pytest -vv ./tests