Skip to content

sanger-pathogens/Bio-ENA-DataSubmission

Repository files navigation

Bio-ENA-DataSubmission

Scripts for submitting data to the ENA.
Build Status
License: GPL v3
codecov

Contents

Introduction

Bio-ENA-DataSubmission provides tools for generating, validating and submitting manifests to the ENA. The following scripts are included:

  • generate_sample_manifest Generate a sample manifest for preparation of sample metadata updates at ENA
  • validate_sample_manifest Validate the sample manifest and check that all compulsory fields are filled out, the formatting, and that the taxon ID and species name match up
  • compare_sample_metadata Compare sample manifest to the existing data on the ENA public records
  • update_sample_metadata Convert the sample manifest to an xml file and send it to datahose to submit to ENA
  • generate_analysis_manifest Generate an analysis manifest for preparation of genome assemblies to the ENA
  • submit_analysis_objects_via_cli.pl Submit genome assemblies or annotated assemblies to the ENA
  • validate_embl Run the ENA EMBL validator to check for issues with the EMBL file before submission
  • extract_erz.sh Extract ERZ numbers out of the webin cli output

Installation

Bio-ENA-DataSubmission has the following dependencies:

Required dependencies

Details for installing Bio-ENA-DataSubmission are provided below. If you encounter an issue when installing Bio-ENA-DataSubmission please contact your local system administrator. If you encounter a bug please log it here or email us at path-help@sanger.ac.uk.

From Source

Clone the repository:

git clone https://github.com/sanger-pathogens/Bio-ENA-DataSubmission.git

Move into the directory and install all dependencies using DistZilla:

cd Bio-ENA-DataSubmission
dzil authordeps --missing | cpanm
dzil listdeps --missing | grep -v 'VRTrack::Lane' | cpanm

Run the tests:

dzil test
If the tests pass, install Bio-ENA-DataSubmission:

dzil install

Running the tests

The test can be run with dzil from the top level directory:

dzil test

Running end to end tests (requires database and correct directory structure

To enable the end 2 end tests, set the environment variable ENA_SUBMISSIONS_E2E to anything. Then run

dzil test

Prerequisite

  • Java needs to be installed to run webin cli
  • environment variable ENA_SUBMISSIONS_WEBIN_CLI should point to the webin cli jar
  • environment variable ENA_SUBMISSIONS_CONFIG should point to the general configuration of ena submissions
  • environment variable ENA_SUBMISSIONS_DATA should point to the folder containing
    • SRA.common.xsd
    • embl-client.jar
    • sample.xsd
    • submission.xml
    • submission.xsd
    • valid_countries.txt

Usage

The following scripts are included in Bio-ENA-DataSubmission.

generate_sample_manifest

Usage: generate_sample_manifest [options]

  -t|type          lane|study|file|sample
  --file_id_type   lane|sample  define ID types contained in file. default = lane
  -i|id            lane ID|study ID|file of lane IDs|file of sample accessions|sample ID
  --empty          generate empty manifest
  -o|outfile       path for output manifest
  -h|help          this help message

  When supplying a file of sample IDs ("-t file --file_id_type sample"), the IDs should
  be ERS numbers (e.g. "ERS123456"), not sample accessions.

validate_sample_manifest

Usage: validate_sample_manifest [options]

    -f|file       input manifest for validation
    -r|report     output path for validation report
    --edit        create additional manifest with mistakes fixed (where possible)
    -o|outfile    output path for edited manifest
    -h|help       this help message

compare_sample_metadata

Usage: validate_sample_manifest [options]

    -f|file       input manifest for comparison
    -o|outfile    output path for comparison report
    -h|help       this help message

update_sample_metadata

Usage: update_sample_manifest [options]

    -f|file       input manifest for update
    -o|outfile    output path for validation report
    --no_validate skip validation step (for cases where validation has already been done)
    -h|help       this help message

generate_analysis_manifest

Usage: generate_analysis_manifest [options]

    -t|type          lane|study|file|sample
    -i|id            lane ID|study ID|file of lanes|file of samples|sample ID
    -o|outfile       path for output manifest
    --empty          generate empty manifest
    -p|pubmed_id     pubmed ID associated with analysis
    -a|file_type     [assembly|annotation] defaults to assembly
    -h|help          this help message

submit_analysis_objects_via_cli.pl

Usage: submit_analysis_objects_via_cli.pl [options] -f manifest.xls

	-f|file        Excel spreadsheet manifest file (required)
	-o|output_dir  Base output directory. A subdirectory within that will be created for the submission (required)
	-c|context     Submission context ( one of genome, transcriptome, sequence, reads. Default: genome)
	--no_validate  Do not run validation step
	--no_submit    Do not run submit step
	--test         Use the ENA test submission service
	-h|help        This help message    
    
    

validate_embl

This script is not longer required as embl validation is performed while submitting using submit_analysis_objects_via_cli.pl

Usage: validate_embl [options] embl_files
    -h|help        This help message

extract_erz.sh

This scripts take no arguments put needs to be run from the webincli output (ie the standard ena output directory for this run)

Development using vagrant

Follow instructions here

License

Bio-ENA-DataSubmission is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues to the issues page or email path-help@sanger.ac.uk.