Large sequence archives like NCBI SRA can be unruly for beginners. They tend not not provide much overview of what is there, or how to contextualize those data (where/when).
Scripts are written for Python and R. Below, I detail how I made:
- plot of the most common species in the NCBI trace assembly archive, the archive of assembled transcriptomes (dominated by the insect transcriptome project)
- plot of the most common samples in NCBI SRA (human and mouse samples dominate)
- plot of total species diversity, by taxonomic rank (bacteria dominate)
- plot of most common sources of metagenomes (human and gut samples dominate, then soil, then water)
- map of global distribution of metagenomic samples (can be modified to show any category)
- interactive map app of the distribution of metagenomic samples, using RShiny and leaflet
The master list of the transcriptome shotgun archive can be found at the NCBI Trace archive. On the trace archive, select only the TSA projects, and download the file wgs_selector.csv
(default name) and rename as desired; I usually add the date.
At this same site, gzipped FASTA-format files can also be downloaded at the NCBI Trace archive, by species name (e.g. Hormiphora californensis) or the 4-letter accessions (like GGLO01).
This currently contains transcriptomes of:
To get the kingdom and phylum, NCBI Taxonomy files can be downloaded at from the taxdump.tar.gz
file at the FTP:
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
The current, full link can be captured in the command line with:
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
This should be unzipped, and it may be useful to rename this folder by date as well.
Using those databases, the kingdom, phylum, and class can be added to each species:
parse_ncbi_taxonomy.py -n ~/db/taxonomy_20210518/names.dmp -o ~/db/taxonomy_20210518/nodes.dmp --csv -i wgs_selector_tsa_only_20220606.csv --numbers > wgs_selector_tsa_only_20220606.w_king.tsv
Then generate the summary barplot using the R script:
Rscript taxon_barplot.R wgs_selector_tsa_only_20220606.w_king.tsv
As of June 2022, NCBI SRA has over 16 million samples. Previously in May 2021 NCBI Trace Archive contained over 10M entries, accounting for 22 petabases (quadruple increase from 6 petabases at the end of 2017).
Chordates (mostly human samples, or mouse) account for over 3.8 million of those, and "uncategorized" samples (probably environmental metagenomic samples) account for over 2.6 million.
The entire metadata library of SRA can be downloaded, and then parsed directly from the .tar.gz
file (which is several Gb). In general, the folder structure can be viewed from the tarball with:
tar -tzf NCBI_SRA_Metadata_Full_20180402.tar.gz | more
Reading from the archive took a long time (rather slowly over several days with 1 CPU). As the file became rapidly larger over the years, and I was extracting more data from the archive, the run time approached a month yet still demanded around 20Gb of memory. Thus, a new version using glob on the unzipped archive was implemented. This trades time and memory for harddive space, as the zipped archive (April 2021) is 4.8Gb, while the unzipped archive is 169Gb (97% compression). It should be noted that gzip -l
will not give accurate measures of compression rate for files this size.
touch starttime
tar -zxpf NCBI_SRA_Metadata_Full_20220117.tar.gz -C Full_20220117/
touch endtime
# -rw-rw-r-- 1 user user 0 Jun 22 15:38 starttime
# -rw-rw-r-- 1 user user 0 Jun 23 15:13 endtime
This first version generated a 4-column table containing: sample name, the SRA number, the NCBI Taxonomy number, the scientific name (species or environment). It is now preferable for other operations to use another script to produce a longer table of 12 columns for downstream analyses. Based on the xml files present, a large number of folders do not have a sample.xml
file, which creates a long list of warnings in the script. An example STDERR is shown for the command below.
./parse_long_sra_metadata.py NCBI_SRA_Metadata_Full_20210104.tar.gz > NCBI_SRA_Metadata_Full_20210104.samples_ext.tab 2> NCBI_SRA_Metadata_Full_20210104.samples.log
# parsing metadata from NCBI_SRA_Metadata_Full_20190914.tar.gz Wed Dec 11 12:24:59 2019
# 100 WARNINGS, WILL NOT DISPLAY MORE Wed Dec 11 12:42:42 2019
# 100000 folders Thu Dec 12 20:32:08 2019
# 200000 folders Sat Dec 14 02:31:20 2019
# 300000 folders Sun Dec 15 06:28:39 2019
# 400000 folders Mon Dec 16 08:23:30 2019
# 500000 folders Tue Dec 17 08:14:53 2019
# 600000 folders Wed Dec 18 05:59:39 2019
# 700000 folders Thu Dec 19 02:02:55 2019
# 800000 folders Thu Dec 19 19:42:26 2019
# 900000 folders Fri Dec 20 11:20:12 2019
# 1000000 folders Sat Dec 21 00:52:02 2019
# 1100000 folders Sat Dec 21 12:21:58 2019
# 1200000 folders Sat Dec 21 21:50:16 2019
# 1300000 folders Sun Dec 22 05:14:40 2019
# Last folder was 1391232, ERA542143/ERA542143.sample.xml Sun Dec 22 10:09:16 2019
# Process completed in 15704.3 minutes
# Found 1391232 folders, and 6298073 samples
# Found 4222631 members not in folders
# Could not find samples for 182338 folders
The long table (from parse_long_sra_metadata.py
) is then processed with the taxonomy database to make a 4-column table, of species name, kingdom, phylum and class. This can be added to any existing spreadsheet or processed into the barplot. Here, the numbers are used as input by adding the option --numbers
to parse_ncbi_taxonomy.py
. This requires the NCBI Taxonomy data (see above)
./parse_ncbi_taxonomy.py -i NCBI_SRA_Metadata_Full_20210104.samples_ext.tab -n ~/db/taxonomy-2021-04-22/names.dmp -o ~/db/taxonomy-2021-04-22/nodes.dmp --numbers --samples --header > NCBI_SRA_Metadata_Full_20210104.w_kingdom.tab
Rscript taxon_barplot.R NCBI_SRA_Metadata_Full_20210104.w_kingdom.tab
As the above command had counted each sample separately, species can instead be combined to give a sense of the species diversity. This is done by adding the --unique
option to the parse_ncbi_taxonomy.py
script.
./parse_ncbi_taxonomy.py -i NCBI_SRA_Metadata_Full_20210104.samples_ext.tab -n ~/db/taxonomy-2021-04-22/names.dmp -o ~/db/taxonomy-2021-04-22/nodes.dmp --numbers --samples --header --unique > NCBI_SRA_Metadata_Full_20210104.w_kingdom_unique.tab
The Rscript then creates the graph, displaying a similar pattern to the number of samples.
Rscript taxon_barplot.R NCBI_SRA_Metadata_Full_20210104.w_kingdom_unique.tab
Almost a quarter of the samples are metagenomic, i.e. those in the "None" category for kingdom, etc. These can be parsed out of the samples file, as the names.dmp
contains specific numbers for a number of environmental or biological categories. The option --metagenomes-only
restricts the analysis to the 333 current metagenomic numbers.
./parse_ncbi_taxonomy.py -i NCBI_SRA_Metadata_Full_20210104.samples_ext.tab -n ~/db/taxonomy-2021-04-22/names.dmp -o ~/db/taxonomy-2021-04-22/nodes.dmp --numbers --samples --metagenomes-only > NCBI_SRA_Metadata_Full_20210104.metagenomes.tab
This is again used as input for the Rscript, to generate another barplot. Obviously, human samples account for a major part, though apparently soil
has taken the lead from 2018 to 2019, but was overtaken again in 2020.
Rscript metagenomes_barplot.R NCBI_SRA_Metadata_Full_20210104.metagenomes.tab
The 12-column table (from parse_long_sra_metadata.py
) includes scientific name, lat-lon, date, source, and location (as best given).
./parse_long_sra_metadata.py NCBI_SRA_Metadata_Full_20210104.tar.gz > NCBI_SRA_Metadata_Full_20210104.samples_ext.tab 2> NCBI_SRA_Metadata_Full_20210104.samples.log
This is converted into the organized 13-column table including the category as a final column:
./parse_ncbi_taxonomy.py -i NCBI_SRA_Metadata_Full_20210104.samples_ext.tab -n ~/db/taxonomy-2021-04-22/names.dmp -o ~/db/taxonomy-2021-04-22/nodes.dmp --numbers --samples --metagenomes-only > NCBI_SRA_Metadata_Full_20210104.metagenomes.tab
The latlon information contains a lot of errors due to different versions or missing data. This must be fixed. The same goes for the dates, which were supposed to be in one of a few formats, but can also be missing or contain errors.
./polish_metagenome_table.py -i NCBI_SRA_Metadata_Full_20210104.metagenomes.tab > NCBI_SRA_Metadata_Full_20210104.metagenomes_latlon-fixed.tab
# Reading NCBI_SRA_Metadata_Full_20210104.metagenomes.tab
# Counted 2487285 entries, wrote 1179990 entries
# 818175 entries had 'VOID' as lat-lon from previous steps, removed
# 483825 entries did not include lat-lon (missing, not collected, etc.), removed
# 106 entries had other values as lat-lon, removed
# 5189 entries had an unknown format of lat-lon, removed
# 50 entries had lat-lon as deg-min-sec format, fixed
# 15 entries had lat-lon as a range, fixed
# 2461 entries had unusual formats, fixed
# 1084214 entries had acceptable date format, 95776 were missing date
# 455 entries had improbable sample dates (before 1990)
The output is an extended 16-column tabular file, where latlon has been split to 2 columns, and the date has been split into 3 (year month day). This makes it much easier to sort in R using the location or year.
The v1 filtered tabular data can be downloaded here. This may be updated later to include approximate locations when the location tag is given (for cities, parks, rock formations, et cetera).
This is used within the R script metagenomes_map.R. Due to the large number of points, it is better to use interactively, with the version below.
I had attempted two versions of an interactive app, one with the base shiny package, and the other using the fancier leaflet. The leaflet
one is far better, with easy scrolling, sample popups, satellite view, and most of the transparency is handled by the app. However, it lacks the brushedPoints()
feature of the base plotting, so the sample table just shows all samples within the current view, which might be a lot.
This is a merged screenshot of the leaflet
version:
With so many packages, there are often compatibility issues, so here is the sessionInfo()
:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DT_0.17 dplyr_1.0.5 leaflet_2.0.4.1 shiny_1.6.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 pillar_1.5.1 compiler_3.6.3
[4] bslib_0.2.4 later_1.1.0.1 jquerylib_0.1.3
[7] tools_3.6.3 digest_0.6.27 jsonlite_1.7.2
[10] lifecycle_1.0.0 tibble_3.1.0 pkgconfig_2.0.3
[13] rlang_0.4.10 rstudioapi_0.13 cli_2.3.1
[16] DBI_1.1.1 crosstalk_1.1.1 yaml_2.2.1
[19] fastmap_1.1.0 withr_2.4.1 leaflet.providers_1.9.0
[22] generics_0.1.0 vctrs_0.3.6 htmlwidgets_1.5.3
[25] sass_0.3.1 tidyselect_1.1.0 glue_1.4.2
[28] R6_2.5.0 fansi_0.4.2 purrr_0.3.4
[31] magrittr_2.0.1 promises_1.2.0.1 ellipsis_0.3.1
[34] htmltools_0.5.1.1 assertthat_0.2.1 mime_0.10
[37] xtable_1.8-4 httpuv_1.5.5 utf8_1.2.1
[40] cachem_1.0.4 crayon_1.4.1
Among the most common user-entered errors is swapping NS or EW. Longitude 0 would be in London, so most of Europe/Asia is E, meaning +X, while the Americas would be W, meaning -X. Here are a few samples that should be E and are given longitude W.
Another obvious error is the presence of these stripes of samples, almost looking like oceanic transects. Each sample has an increasing latitude of the one before by exactly 1 unit, likely due to a drag-copy incrementing introduced by Excel. Probably, the submitters intended to copy some of the metadata fields in the spreadsheet before submitting to NCBI, not realizing that it would copy the text but try to increment numbers wherever it could.
These two below are examples that appear to be errors, but it is not clear where the (likely) typo occurred. For Banyuls-sur-mer, it is likely that the 47.27 N was supposed to be 42.7 N, as the longitude appears to be correct.
The given geographic place may also be quite offset from the given lat-lon.
Assemblies can be downloaded directly from the NCBI FTP using wget
, which can be called through the script download_ncbi_tsa.py
. The only input requirement (-i
) is a file of the accession numbers.
download_ncbi_tsa.py -i download_codes.txt
The download codes are the 4-letter accessions, with one accession per line. Only the first 4 characters are used, so the names can be directly copied out of the table, and pasted into a text file like:
GAUS.gz
GAUU.gz
GAVC.gz
With each download, the FASTA headers are changed and written into a new file, in the format of Genus_species_XXXX01.renamed.fasta
, where XXXX
is the accession number. For example, from GFGY01.1.fsa_nt.gz
, a new file would be made named Paramacrobiotus_richtersi_GFGY01.renamed.fasta
, and the FASTA header of first sequence:
$ gzip -dc GFGY01.1.fsa_nt.gz | head
>GFGY01000001.1 TSA: Paramacrobiotus richtersi comp116965_c2 transcribed RNA sequence
would be changed to:
$ head Paramacrobiotus_richtersi_GFGY01.renamed.fasta
>Paramacrobiotus_richtersi_comp116965_c2
to preserve the information from Trinity components and allow better downstream identification of splice variants (perhaps from BLAST hits). This works for the vast majority of transcriptomes, which are assembled with Trinity, though it may be necessary to confirm for each sample.
To add kingdom, phylum and class ranks to this table, so it can be more easily searched. Copy the organism
column to a text file (here this is named as sra_trace_species_list_2018-04-05.txt
). Then run:
parse_ncbi_taxonomy.py -n names.dmp -o nodes.dmp -i sra_trace_species_list_2018-04-05.txt > sra_trace_species_w_phyla_2018-04-05.txt
In the older method, the names had to be extracted with cut
, using grep
to remove the header line organism_an
. Then use the names and the taxonomy files to regenerate the table including kingdom.
cut -f 5 -d , wgs_selector_tsa_only_2018-08-23.csv | grep -v organism_an > wgs_selector_tsa_only_2018-08-23.names_only
parse_ncbi_taxonomy.py -i wgs_selector_tsa_only_2018-08-23.names_only -n ~/db/taxonomy/names.dmp -o ~/db/taxonomy/nodes.dmp --header > wgs_selector_tsa_only_2018-08-23.w_kingdom.tab
Rscript taxon_barplot.R wgs_selector_tsa_only_2018-08-23.w_kingdom.tab
Thus, it is better to use the methods above, which can read directly from the NCBI TSA csv file.