Updated readme with file structure and clear purpose and examples

eastgenomics · Jan 23, 2024 · 291113a · 291113a
1 parent 6fd728d
commit 291113a
Showing 1 changed file with 111 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -1,19 +1,73 @@
 # gene_annotation2bed
 
-Custom script for processing a list of ids (HGNC, transcript) or coordinates with associated annotation, into a comprehensive bed file for the corresponding refseq transcripts for each ID entry.
+## Purpose
+To provide bed files for custom gene-level annotation with VEP.
+This custom script processes a list of ids (HGNC, transcript) or coordinates with associated annotation, into a comprehensive bed file for the corresponding refseq transcripts for each ID entry.
 
+![Workflow diagram showing TSV containing IDs and annotation to bed file and how it is used in VEP and visualised in IGV using a VCF](https://raw.githubusercontent.com/eastgenomics/gene_annotation2bed/sprint_2/Workflow.png)
+
+---
 ## What are typical use cases for this script?
 
 - Converting a list of HGNC ids + associated gene level annotation information
   into a comprehensive bed file for annotation with Ensemble's VEP.
 - Other use cases include providing different inputs such a list of transcripts.
   Or using exact coordinates to flag a regions such as TERT promoter.
 
+---
 ## What data are required for this script to run?
 
 - List of ids and annotation information in TSV format.
-- Human Genome Reference (i.e. hs37d5)
-- RefSeq Transcripts file (gff3)
+- Human Genome Reference (i.e. hs37d5).
+- RefSeq Transcripts file (gff3) from 001_reference.
+
+---
+
+## What inputs are required for this app to run?
+
+### Required
+- `-ig`, `--annotation_file` (`str`): Path to the annotation file (TSV), this file is essential for the app to execute successfully.
+- `-o`, `output` (`str`): Output file suffix, required for specifying the suffix for the generated output files.
+- `-build`, `--genome_build` (`str`): Reference genome build (hg19/hg38), choose either 'hg19' or 'hg38' based on your requirements.
+- `-f`, `--flanking` (`int`): Flanking size, an integer value representing the size of flanking regions for each gene, transcript or coordinates provided.
+- `--assembly_summary` (`str`): Path to assembly summary file, necessary for the app to gather assembly information.
+- `-gff` (`str`): Path to GFF file containing all relevant transcripts for assay, available in 001_reference i.e. GCF_000001405.25_GRCh37.p13_genomic.gff.
+
+### Useful ones
+
+#### Files
+- `-ref_igv`, `--reference_file_for_igv` (`file`): Path to the Reference genome fasta file for igv_reports, used in generating IGV reports.
+- `-dump`, `--hgnc_dump_path` (`file`): Path to HGNC TSV file with HGNC information. Required if gene symbols are present (`-gs` is specified).
+
+#### Booleans
+- `-gs`, `-symbols_present` (`bool`): Flag to indicate whether gene symbols are present in the annotation file.
+
+## Misc
+- `-pickle` (`str`): Import GFF as a pickle file, this is for testing mostly to speed-up running, so gff isn't processed each time.
+
+## Example Command
+
+```bash
+python gene_annotation2bed.py -ig annotation.tsv -o output_suffix -ref hg38 -f 50 --assembly_summary assembly_summary.txt -ref_igv ref_genome.fasta -symbols_present --hgnc_dump_path hgnc_info.tsv -gff your_file.gff -pickle pickle_file.pkl
+```
+
+---
+
+## Requirements
+
+- pysam
+- pandas
+- igv-reports (v)
+- numpy
+- re
+
+install using `requirements.txt`. `pip install requirements.txt`
+
+---
+
+## How does this app work?
+
+![Workflow diagram showing TSV containing IDs and annotation to bed file and how it is used in VEP and visualised in IGV using a VCF](https://raw.githubusercontent.com/eastgenomics/gene_annotation2bed/sprint_2/Workflow.png)
 
 ## IGV reports output
 
@@ -23,6 +77,7 @@ IGV report:
 The script produces a HTML report of all the bed file entries. Displayed in IGV with the refseq track
 and bed file aligned with the respecive annotation.
 
+<!--
 ## Script Inputs - Defaults & Behaviour
 
 - `Genome` (required): The genome build for the resource
@@ -35,15 +90,59 @@ and bed file aligned with the respecive annotation.
 - Flanking (int): The required flanking either side of the transcripts selected.
 - Assembly summary - corresponding assembly report file for the refseq.gff, this is used
   to determine the corresponding chromosome for each transcript.
+ -->
 
-## Requirements
-
-- pysam
-- pandas
-- igv-reports (v)
-- numpy
-- re
 
-install using `requirements.txt`. `pip install requirements.txt`
+## Strucute of code
 
-## Running Script
+├── data
+│   ├── demo
+│   │   ├── after_table.tsv.gz
+│   │   ├── before_table.tsv.gz
+│   │   ├── demo_igv_reports.png
+│   │   └── initial_table.tsv.gz
+│   ├── GCF_000001405.25_GRCh37.p13_assembly_report.txt
+│   ├── GCF_000001405.25_GRCh37.p13_genomic.gff
+│   ├── hg19
+│   │   ├── ncbiRefSeq.txt.gz
+│   │   ├── ncbiRefSeq.txt.gz.tbi
+│   │   ├── refGene.txt.gz
+│   │   └── refGene.txt.gz.tbi
+│   └── hg38
+│       ├── ncbiRefSeq.txt.gz
+│       ├── ncbiRefSeq.txt.gz.tbi
+│       ├── refGene.txt.gz
+│       └── refGene.txt.gz.tbi
+├── gene_annotation2bed.py (MAIN SCRIPT)
+├── LICENSE
+├── output_new_test.vcf
+├── README.md
+├── requirements.txt
+├── scripts
+│   ├── construct_vcf.py
+│   └── igv_report.py
+│
+├── tests
+│   ├── __init__.py
+│   ├── test_construct_vcf.py
+│   ├── test_data
+│   │   ├── coordinates_anno_test.tsv
+│   │   ├── example_bed_hg38.bed
+│   │   ├── expected_output.vcf
+│   │   ├── hgcn_ids_anno_test.tsv
+│   │   ├── hs37d5.fa
+│   │   ├── hs37d5.fa.fai
+│   │   ├── refseq_gff_preprocessed.pkl
+│   │   ├── test_empty_attributes.gff
+│   │   ├── test_empty.gff
+│   │   ├── test_missing_attributes.gff
+│   │   └── transcripts_anno_test.tsv
+│   ├── test_gene_annotation2bed.py
+│   ├── test_gff_parsing.py
+│   └── test_igv_report.py
+│
+├── utils
+│   ├── configure_gff.py
+│   ├── gff2pandas.py
+│   ├── __init__.py
+└── Workflow.png