Scripts for submitting data to the ENA.
- Introduction
- Installation
- Usage
- [Development using vagrant](#Development using vagrant)
- License
- Feedback/Issues
Bio-ENA-DataSubmission provides tools for generating, validating and submitting manifests to the ENA. The following scripts are included:
- generate_sample_manifest Generate a sample manifest for preparation of sample metadata updates at ENA
- validate_sample_manifest Validate the sample manifest and check that all compulsory fields are filled out, the formatting, and that the taxon ID and species name match up
- compare_sample_metadata Compare sample manifest to the existing data on the ENA public records
- update_sample_metadata Convert the sample manifest to an xml file and send it to datahose to submit to ENA
- generate_analysis_manifest Generate an analysis manifest for preparation of genome assemblies to the ENA
- submit_analysis_objects_via_cli.pl Submit genome assemblies or annotated assemblies to the ENA
- validate_embl Run the ENA EMBL validator to check for issues with the EMBL file before submission
- extract_erz.sh Extract ERZ numbers out of the webin cli output
Bio-ENA-DataSubmission has the following dependencies:
Details for installing Bio-ENA-DataSubmission are provided below. If you encounter an issue when installing Bio-ENA-DataSubmission please contact your local system administrator. If you encounter a bug please log it here or email us at path-help@sanger.ac.uk.
Clone the repository:
git clone https://github.com/sanger-pathogens/Bio-ENA-DataSubmission.git
Move into the directory and install all dependencies using DistZilla:
cd Bio-ENA-DataSubmission
dzil authordeps --missing | cpanm
dzil listdeps --missing | grep -v 'VRTrack::Lane' | cpanm
Run the tests:
dzil test
If the tests pass, install Bio-ENA-DataSubmission:
dzil install
The test can be run with dzil from the top level directory:
dzil test
To enable the end 2 end tests, set the environment variable ENA_SUBMISSIONS_E2E
to anything. Then run
dzil test
- Java needs to be installed to run webin cli
- environment variable
ENA_SUBMISSIONS_WEBIN_CLI
should point to the webin cli jar - environment variable
ENA_SUBMISSIONS_CONFIG
should point to the general configuration of ena submissions - environment variable
ENA_SUBMISSIONS_DATA
should point to the folder containing- SRA.common.xsd
- embl-client.jar
- sample.xsd
- submission.xml
- submission.xsd
- valid_countries.txt
The following scripts are included in Bio-ENA-DataSubmission.
Usage: generate_sample_manifest [options]
-t|type lane|study|file|sample
--file_id_type lane|sample define ID types contained in file. default = lane
-i|id lane ID|study ID|file of lane IDs|file of sample accessions|sample ID
--empty generate empty manifest
-o|outfile path for output manifest
-h|help this help message
When supplying a file of sample IDs ("-t file --file_id_type sample"), the IDs should
be ERS numbers (e.g. "ERS123456"), not sample accessions.
Usage: validate_sample_manifest [options]
-f|file input manifest for validation
-r|report output path for validation report
--edit create additional manifest with mistakes fixed (where possible)
-o|outfile output path for edited manifest
-h|help this help message
Usage: validate_sample_manifest [options]
-f|file input manifest for comparison
-o|outfile output path for comparison report
-h|help this help message
Usage: update_sample_manifest [options]
-f|file input manifest for update
-o|outfile output path for validation report
--no_validate skip validation step (for cases where validation has already been done)
-h|help this help message
Usage: generate_analysis_manifest [options]
-t|type lane|study|file|sample
-i|id lane ID|study ID|file of lanes|file of samples|sample ID
-o|outfile path for output manifest
--empty generate empty manifest
-p|pubmed_id pubmed ID associated with analysis
-a|file_type [assembly|annotation] defaults to assembly
-h|help this help message
Usage: submit_analysis_objects_via_cli.pl [options] -f manifest.xls
-f|file Excel spreadsheet manifest file (required)
-o|output_dir Base output directory. A subdirectory within that will be created for the submission (required)
-c|context Submission context ( one of genome, transcriptome, sequence, reads. Default: genome)
--no_validate Do not run validation step
--no_submit Do not run submit step
--test Use the ENA test submission service
-h|help This help message
This script is not longer required as embl validation is performed while submitting using submit_analysis_objects_via_cli.pl
Usage: validate_embl [options] embl_files
-h|help This help message
This scripts take no arguments put needs to be run from the webincli output (ie the standard ena output directory for this run)
Follow instructions here
Bio-ENA-DataSubmission is free software, licensed under GPLv3.
Please report any issues to the issues page or email path-help@sanger.ac.uk.