Scripts and framework for evaluating annotation errors for user-selected gene families, taxonomically delimited. Uses RefSeq Genomes, RefSeq Proteins as a "Standard Mean Reference" to identify outlying annotation parameters from orthologous non-RefSeq genes of interest.
1. Retrieve table (.csv) of Assembly stats from a specified Higher Taxa: (AssemblyStatsFromTaxa.sh)
bash AssemblyStatsFromTaxa.sh <NCBI tax id>
Example 9443 (primates)
Output Example: (AssemblyStats.csv)
2. Assembly Stats analysis (AssemblyStatsCompare.R)
Rscript AssemblyStatsCompare.R
Produces a viewable .pdf called Rplots.pdf
Output Example: (AssemblyStatsGraphs.md)
3. Retrieve table (.csv) of Protein stats for a specified gene ortholog group: (ProtStatsFromGeneID.sh)
bash ProtStatsFromGeneID.sh <NCBI Gene Ortholog Id> <NCBI tax id>
Example gene 29102 (Droshas), 9989 (Rodents)
Output Example: (ProtStats.csv)
4. Protein Stats analysis (ProtStatsCompare.R, reads output from ProtStatsFromGeneID.sh)
Rscript ProtStatsCompare.R <txid 1> <txid 2> <NCBI Gene Ortholog Id>
** Takes output from two different taxa (assuming same orthology group) and compares them
Output Example (Rodents/Primates) - Graphs:(ProtStatsResults.md), list of Protein seqs outside standard deviation ranges: (Prot_Abnormals.csv).
5. Retrieve Gene .fastas for a given Homologene uid, (pulls gene sequence from Assembly using chr_start,chr_stop positions)(GeneFastaFromHomlogene.sh)
bash GeneFastaFromHomologene.sh <Family name> <NCBI Homologene uid>
Example bash GeneFastaFromHomologene.sh Drosha 8293
** Note, Gene Orthologs only extends through vertebrates. Homologene has some limited coverage in invertebrate model organisms.
6. Retrieve Protein .fastas of given GeneIDs with associated RefSeq genomes. (ProtFastaFromGene.sh)
bash ProtFastaFromGene.sh <NCBI Gene uid>
7. Retrieve RefSeq Assembly .gz files for taxa of interest. (AssemblyRefseqFastasByTax.sh)
bash AssemblyRefseqFastasByTax.sh <NCBI taxid>
9. Retrieve Non-RefSeq Genome, Protein accessions from Taxonomy subset of interest. Compare meta-stats to "Reference" sequence SD values, find sequences outside Reference ranges, or with divergent BLAST results.
- Retrieve child taxa from a parent node using ETE3 (use ChildTaxaByParent.py).
- Make a comparative heuristic, Protein Slen vs Assembly quality - between and within taxon parent groups, RefSeq vs non-RefSeqs, Assembly vs. Protein stats.