This repository contains the New York City Public Health Laboratory local datasets for the paper, "Lineage assignments using phylogenetic placement/UShER is superior to machine learning methods." COVID samples were collected between August 01, 2021 and November 30, 2021.
- pipeline.txt - Overview of pipeline and associated scripts
Data files:
Fasta files can be directly inputted into any software that takes multi-fasta format such as pangolin or Nextclade. This is not to be confused with multiple sequence alignment (MSA), which aligns the sequences against each other instead of just listing them.
- nyc_failed_aug-nov2021.fasta - Multi-Fasta file containing 469 genome consensus sequences for SARS-CoV-2 that had N >10%.
- nyc_passed_aug-nov2021.fasta.xz - Compressed fasta file containing genome consensus sequences for SARS-CoV-2 that had N <10%.
- To uncompress with the xz-utils package, the command is
unxz nyc_passed_aug-nov2021.fasta.xz
- To uncompress with the xz-utils package, the command is
- ca_nyc_mafft.alignment.fasta.xz - Compressed MSA created by MAFFT
- To uncompress with the xz-utils package, the command is
unxz ca_nyc_mafft.alignment.fasta.xz
- To uncompress with the xz-utils package, the command is
- pango_consensus_ca_nyc.aligned_sept8_2023_masked_maple034inferenceJC_noAncertaintyAssignment_reRooted_nexusTree.tree - MAPLE tree used for lineage assignment validation
- 60k_public_meta.tsv - NCBI metadata for 2021 global dataset
- 2022-global-episet.pdf - GISAID supplemental table to access 2022 global dataset
Script files:
- compareLineages.py - Python script to compare pangolin lineages to MAPLE tree
- comparison_script_w_ami.py - Python script to calculate Adjusted Mutual Information
- snp_scorpio-comparisons.sh - Bash script to preprocess SNP distance matrix
- snp_scorpio-comparisons.Rmd - R script to analyze SNP distance matrix data and scorpio
- tables_and_violin_plots.R - R script to create visualization to compare genome coverage to reassignment and other tables
- sankey_plots.R - R script to create visualization to look at lineage stability
- Files catalogueing new lineages during the study periodbetween pangolin versions which we considered as permitted changes
- expected.13.14.tsv
- expected.14.15.tsv
- expected.15.16.tsv
- expected.2021-11-09_v1.2.133.tsv
Supplemental files:
- Supplementary_table_1.csv
- Supplementary_table_2.csv
- Supplementary_table_3.csv
- Supplementary_table_4.csv
- Supplementary_file_public_60k_pusher_nohash.html - Interactive HTML showing the lineage reassignments across different versions of pUSHER
- Supplementary_file_public_60k_plearn_nohash.html - Interactive HTML showing the lineage reassignments across different versions of pangoLEARN
- Adriano de Bernardi Schneider
- Michelle Su
- Angie S. Hinrichs
- Jade Wang
- Helly Amin
- John Bell
- Debra A. Wadford
- Ainde O'Toole
- Emily Scher
- Marc D. Perry
- Yatish Turakhia
- Nicola De Maio
- Andrew Rambaut
- Scott Hughes
- Russ Corbett-Detig