Skip to content

A framework and family of scripts to evaluate molecular evolution (and misannotation) of gene ortholog groups, between higher taxa.

License

Notifications You must be signed in to change notification settings

NCBI-Hackathons/GeneFamTaxScan

Repository files navigation

Scripts and framework for evaluating annotation errors for user-selected gene families, taxonomically delimited. Uses RefSeq Genomes, RefSeq Proteins as a "Standard Mean Reference" to identify outlying annotation parameters from orthologous non-RefSeq genes of interest.

GeneFamTaxScan

Steps:

1. Retrieve table (.csv) of Assembly stats from a specified Higher Taxa: (AssemblyStatsFromTaxa.sh)

bash AssemblyStatsFromTaxa.sh <NCBI tax id>

Example 9443 (primates)

Output Example: (AssemblyStats.csv)

2. Assembly Stats analysis (AssemblyStatsCompare.R)

Rscript AssemblyStatsCompare.R

Produces a viewable .pdf called Rplots.pdf

Output Example: (AssemblyStatsGraphs.md)

3. Retrieve table (.csv) of Protein stats for a specified gene ortholog group: (ProtStatsFromGeneID.sh)

bash ProtStatsFromGeneID.sh <NCBI Gene Ortholog Id> <NCBI tax id>

Example gene 29102 (Droshas), 9989 (Rodents)

Output Example: (ProtStats.csv)

4. Protein Stats analysis (ProtStatsCompare.R, reads output from ProtStatsFromGeneID.sh)

Rscript ProtStatsCompare.R <txid 1> <txid 2> <NCBI Gene Ortholog Id>

** Takes output from two different taxa (assuming same orthology group) and compares them

Output Example (Rodents/Primates) - Graphs:(ProtStatsResults.md), list of Protein seqs outside standard deviation ranges: (Prot_Abnormals.csv).

5. Retrieve Gene .fastas for a given Homologene uid, (pulls gene sequence from Assembly using chr_start,chr_stop positions)(GeneFastaFromHomlogene.sh)

bash GeneFastaFromHomologene.sh <Family name> <NCBI Homologene uid>

Example bash GeneFastaFromHomologene.sh Drosha 8293

** Note, Gene Orthologs only extends through vertebrates. Homologene has some limited coverage in invertebrate model organisms.

6. Retrieve Protein .fastas of given GeneIDs with associated RefSeq genomes. (ProtFastaFromGene.sh)

bash ProtFastaFromGene.sh <NCBI Gene uid>

7. Retrieve RefSeq Assembly .gz files for taxa of interest. (AssemblyRefseqFastasByTax.sh)

bash AssemblyRefseqFastasByTax.sh <NCBI taxid>

8. Make BLAST databases from Gene .fastas, RefSeq Protein .fastas, RefSeq Assembly .gz.

9. Retrieve Non-RefSeq Genome, Protein accessions from Taxonomy subset of interest. Compare meta-stats to "Reference" sequence SD values, find sequences outside Reference ranges, or with divergent BLAST results.

  • Retrieve child taxa from a parent node using ETE3 (use ChildTaxaByParent.py).
  • Make a comparative heuristic, Protein Slen vs Assembly quality - between and within taxon parent groups, RefSeq vs non-RefSeqs, Assembly vs. Protein stats.

10. Visualize sequence comparisons (NCBI Genome Workbench).

About

A framework and family of scripts to evaluate molecular evolution (and misannotation) of gene ortholog groups, between higher taxa.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published