Skip to content

Microbial sequences

Keiran Raine edited this page Apr 18, 2016 · 3 revisions

The base file can be downloaded and extracted as follows:

curl -sSL ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz | tar -zvx -O > all_bacteria.fa

I found the faSplit tool to be unreliable in how it was splitting files so I've had to put a simple perl-script in place to handle this:

wget https://raw.githubusercontent.com/cancerit/BRASS/master/perl/bin/utils/brassSplitFasta.pl
chmod u+x brassSplitFasta.pl
mkdir -p out
./brassSplitFasta.pl all_bacteria.fa out/

Next convert these to 2bit files

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
chmod u+x faToTwoBit
mkdir -p 2bit
./faToTwoBit -noMask out/all_ncbi_bacteria.1.fa 2bit/all_ncbi_bacteria.1.fa.2bit
./faToTwoBit -noMask out/all_ncbi_bacteria.2.fa 2bit/all_ncbi_bacteria.2.fa.2bit
./faToTwoBit -noMask out/all_ncbi_bacteria.3.fa 2bit/all_ncbi_bacteria.3.fa.2bit
./faToTwoBit -noMask out/all_ncbi_bacteria.4.fa 2bit/all_ncbi_bacteria.4.fa.2bit

Brass expects 4 of these files with 'N.fa.2bit' as the extension so if you subset the input ensure you still have 4 files.

NOTE: An assumption is made that you are running on x86_64 linux, if not you may need to build faToTwoBit from source, but first check to see if pre-compiled binaries are available for your platform here.

Clone this wiki locally