-
Notifications
You must be signed in to change notification settings - Fork 23
Tutorial
So you've heard that the BuddySuite toolkits can make your life easier and you've installed it on your machine, but now what? There are over 100 tools spread across four separate programs (SeqBuddy, AlignBuddy, PhyloBuddy, and DatabaseBuddy), which is a bit much to just dive right into.
To get you started, the following tutorial is a walk-through of a full workflow that begins by downloading a collection of RefSeq mRNAs and ends with a phylogenetic tree. This should give you a nice sampling of the BuddySuite commands, familiarize you with the general syntax and scope of the tools available, and hopefully whet your appetite for exploring the rest of the suite.
Note that we will be using short-form aliases of each toolkit throughout the tutorial (alb=AlignBuddy, db=DatabaseBuddy, pb=PhyloBuddy, sb=SeqBuddy), while tool flags will be given in long-form in the description and short-form in the actual example (e.g., --pull_records, -pr).
So let's begin...
Download mRNAs for the BRICK1 gene (BRK1) from GenBank using DatabaseBuddy
Below is a screen capture of a DatabaseBuddy session where RefSeq mRNAs are retrieved from GenBank for the cytoskeleton protein BRK1. As you replicate each step yourself, take a moment to familiarize yourself with the commands being used by typing help
and then the keyword before you run the actual example command. For instance, before running the first command database ncbi_nuc
, do the following:
DbBuddy> help database
>> Reset the database(s) to be searched. Separate multiple databases with spaces.
>> Currently set to: ncbi_nuc, ncbi_prot, uniprot, ensembl
>> Valid choices: all, ncbi_nuc, ncbi_prot, uniprot, ensembl
The output will be green in your terminal window, and explains what the database
command does. You can also type the word help
on it's own to get a full list of available commands.
One common 'gotcha' with DatabaseBuddy happens during the final write
step. You MUST first specify the format you want to write your fetched sequences in (in this case, we selected GenBank format), otherwise you will create a summary file with no actual sequence data.
Analyze and tidy up BRK1 sequences using SeqBuddy
Approximately 150 mRNAs should have been retrieved during the DatabaseBuddy session. To get an exact count, use the --num_seqs
command:
$: sb brk1_mRNA.gb -ns
>> 153
You may get slightly different results because GenBank is being continuously updated, but you shouldn't have a wildly different number than 153. If you wish to work with the exact sequences used to create this tutorial, they can be found here.
Now let's have a look at what was actually downloaded by listing all of the sequence IDs with the --list-ids
command. By passing a number into the optional columns
argument, the output will be organized into that many columns (in this case, five):
$: sb brk1_mRNA.gb -li 5
>> XM_015069351.1 XM_002925023.3 XM_014607950.2 XM_003217549.3 XM_012471870.1
>> XM_012441180.1 NM_129400.3 NM_127829.3 XM_007254551.2 XM_014006305.1
...
>> XM_008697894.1 XM_006206815.2 NM_001171700.1 NM_001171693.1 NM_001015994.2
>> XM_018093034.1 XM_005796913.1 NM_001111720.1
It looks like many sequences have not been curated (i.e., they are labelled with the XM designation). Let's see exactly how many of the sequences are actually curated by piping the results of the --pull_records
command into the --num_seqs
command:
$: sb brk1_mRNA.gb -pr "NM_" | sb -ns
>> 14
Piping data to and from BuddySuite tools is a core feature of the suite. It doesn't matter whether the piped data comes from a BuddySuite tool or not, so long as the data is sent to stdout in any of the sequence, alignment, or tree formats supported by the receiving tool.
Be careful though! Not all BuddySuite tools actually return sequence, alignment, or tree data. We just saw this in the previous example, because --num_seqs
returns a number. If we tried to pipe this back into --list-ids
, it would fail:
$: sb brk1_mRNA.gb -pr "NM_" | sb -ns | sb -li 5
>> GuessError: Could not determine format from input file-like object
>> --> 14
>> ...
>> Try explicitly setting with -f flag.
The term 'file-like' object is a bit cryptic, as it refers to how SeqBuddy is trying to read the data from the terminal, but if you run into an error message like this check to see if the output from one of your steps is non-sequence data.
Fourteen records is a manageable number to visually inspect, so have a quick look at them:
$: sb brk1_mRNA.gb -pr "NM_"
>> LOCUS NM_166648 349 bp DNA linear INV 13-DEC-2016
>> DEFINITION Drosophila melanogaster haematopoietic stem/progenitor cell protein
>> 300, transcript variant A (HSPC300), mRNA.
...
>> 4801 ctatgtatgt gtatttgaat acatacatag atatgtattg tgtgttctca atcatttgca
>> 4861 gtgtcctttg actgtatatt tggtc
>> //
There is quite a bit of meta data associated with these GenBank records, so for clarity we will often pull out a single record ($: sb brk1_mRNA.gb -pr 'NM_001200626'
) in the following steps as a sanity check:
LOCUS NM_001200626 439 bp DNA linear VRT 10-JUL-2016
DEFINITION Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
subunit (brk1), mRNA.
ACCESSION NM_001200626
VERSION NM_001200626.1
KEYWORDS RefSeq.
SOURCE Ictalurus punctatus (channel catfish)
...
FEATURES Location/Qualifiers
source 1..439
/chromosome="21"
/db_xref="taxon:7998"
/map="21"
/mol_type="mRNA"
/organism="Ictalurus punctatus"
gene 1..439
/db_xref="GeneID:100528356"
/gene="brk1"
/note="BRICK1, SCAR/WAVE actin nucleating complex subunit"
CDS 16..243
/codon_start=1
/db_xref="GeneID:100528356"
/gene="brk1"
/note="probable protein brick1"
/product="protein BRICK1"
/protein_id="NP_001187555.1"
/translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
RLATLNEKLTALERRIEYIEARVTKGETLT"
ORIGIN
1 gcagctggag ggaaaatggc cggccaggag gatccagtgc agagagaaat ccaccaggat
61 tgggcgaatc gcgagtatat tgaagtgatt acaagcagca tcaagaaaat agccgatttc
121 ctcaactcgt ttgatatgtc ctgcaggtcc cgtttagcca cgctgaacga gaagctcacc
181 gccctggaga gaaggatcga gtacatcgaa gcccgggtta caaaagggga aacgctgacc
241 taaccatccc gacgctgacg actggacaac atggccacac tcctgtttct ttttcttttc
301 ttcccctcta tcctcattca ccttaaataa ccaagcgact gtttttcaac cagtctctta
361 aagccccagt gaggaactgc tgatacagtg ctacgctgcg ttaggactta aaaaaggaac
421 atttaaaaaa aaaaaaaaa
//
This is a pretty standard looking messenger RNA, with 3' and 5' untranslated regions and a classic poly-A tail. Notice that the actual coding sequence (CDS) is listed as an annotation, spanning residues 16-243. Every other record in our subset of 14 curated sequences also has a CDS annotation, which is what we want to focus on for our forthcoming phylogenetic analysis. We should still confirm that all 153 sequences in our dataset have a CDS annotation, however, so let's use the --pull_records_with_feature
command:
$: sb brk1_mRNA.gb -prf CDS | sb -ns
>> 153
All 153 sequences were returned, meaning all are appropriately annotated. Now we should pull each CDS out of its record using --extract_feature_sequences
.
$: sb brk1_mRNA.gb -efs CDS
>> LOCUS XM_009569415 114 bp DNA UNK 01-JAN-1980
>> DEFINITION PREDICTED: Cuculus canorus BRICK1, SCAR/WAVE actin-nucleating
...
>> 181 atagagtaca ttgaagcacg ggtgacaaaa ggtgagaccc tcacctag
>> //
Again, let's just look at NM_001200626:
LOCUS NM_001200626 228 bp DNA linear VRT 10-JUL-2016
DEFINITION Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
subunit (brk1), mRNA.
ACCESSION NM_001200626
VERSION NM_001200626.1
KEYWORDS RefSeq.
SOURCE Ictalurus punctatus (channel catfish)
...
FEATURES Location/Qualifiers
source 1..228
/chromosome="21"
/db_xref="taxon:7998"
/map="21"
/mol_type="mRNA"
/organism="Ictalurus punctatus"
gene 1..228
/db_xref="GeneID:100528356"
/gene="brk1"
/note="BRICK1, SCAR/WAVE actin nucleating complex subunit"
CDS 1..228
/codon_start=1
/db_xref="GeneID:100528356"
/gene="brk1"
/note="probable protein brick1"
/product="protein BRICK1"
/protein_id="NP_001187555.1"
/translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
RLATLNEKLTALERRIEYIEARVTKGETLT"
ORIGIN
1 atggccggcc aggaggatcc agtgcagaga gaaatccacc aggattgggc gaatcgcgag
61 tatattgaag tgattacaag cagcatcaag aaaatagccg atttcctcaa ctcgtttgat
121 atgtcctgca ggtcccgttt agccacgctg aacgagaagc tcaccgccct ggagagaagg
181 atcgagtaca tcgaagcccg ggttacaaaa ggggaaacgc tgacctaa
//
Notice how all of the annotations now span the same 228 residues? When SeqBuddy pulled out the subsequence, it also shifted the annotations as needed. Part of the 'gene' and 'source' sequence was deleted, but what remains is still annotated; of course this information is no longer useful, so we can clean the records up a bit by removing those two annotations with the --delete_features
command:
$: sb brk1_mRNA.gb -efs CDS | sb -df 'gene' 'source'
LOCUS NM_001200626 228 bp DNA linear VRT 10-JUL-2016
DEFINITION Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
subunit (brk1), mRNA.
ACCESSION NM_001200626
VERSION NM_001200626.1
KEYWORDS RefSeq.
SOURCE Ictalurus punctatus (channel catfish)
...
FEATURES Location/Qualifiers
CDS 1..228
/codon_start=1
/db_xref="GeneID:100528356"
/gene="brk1"
/note="probable protein brick1"
/product="protein BRICK1"
/protein_id="NP_001187555.1"
/translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
RLATLNEKLTALERRIEYIEARVTKGETLT"
ORIGIN
1 atggccggcc aggaggatcc agtgcagaga gaaatccacc aggattgggc gaatcgcgag
61 tatattgaag tgattacaag cagcatcaag aaaatagccg atttcctcaa ctcgtttgat
121 atgtcctgca ggtcccgttt agccacgctg aacgagaagc tcaccgccct ggagagaagg
181 atcgagtaca tcgaagcccg ggttacaaaa ggggaaacgc tgacctaa
//
Great! So we can isolate the coding sequences for each of our sequences and tidy up the record a little bit. For the phylogeny we're going to use amino acid alignments instead of nucleotide, and the --translate
tool can handle the conversion:
$: sb brk1_mRNA.gb -efs CDS | sb -df 'gene' 'source' | sb -tr
>> Warning: size mismatch between aa and nucl seqs for XM_010176471.1 --> 146, 48
>> Warning: size mismatch between aa and nucl seqs for XM_005063074.1 --> 199, 66
>> LOCUS XM_009569415 38 aa UNK 01-JAN-1980
>> DEFINITION PREDICTED: Cuculus canorus BRICK1, SCAR/WAVE actin-nucleating
...
>> ORIGIN
>> 1 magqedlvqr eidqdwanre yievitssik kiadflnsfd mscrsrlatl nekltalerr
>> 61 ieyiearvtk getlt*
>> //
Uh-oh... There's a warning message telling us that there's something a little weird with records XM_010176471.1 and XM_005063074.1. Use --pull_records
to check them out:
sb brk1_mRNA.gb -pr 'XM_005063074.1' 'XM_010176471.1'
LOCUS XM_005063074 199 bp DNA linear VRT 20-APR-2016
DEFINITION PREDICTED: Ficedula albicollis BRICK1, SCAR/WAVE actin nucleating
complex subunit (BRK1), partial mRNA.
...
COMMENT MODEL REFSEQ: This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NW_004792212.1) annotated using gene prediction method: Gnomon.
Also see:
Documentation of NCBI's Annotation Process
Release 101
pipeline
COMPLETENESS: incomplete on both ends.
FEATURES Location/Qualifiers
source 1..199
/chromosome="Unknown"
/collection_date="2009"
/country="Sweden: Oland"
/db_xref="taxon:59894"
/isolate="OC2"
/mol_type="mRNA"
/organism="Ficedula albicollis"
/sex="male"
gene <1..>199
/db_xref="GeneID:101821110"
/gene="BRK1"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon. Supporting evidence
includes similarity to: 11 Proteins, and 84% coverage of
the annotated genomic feature by RNAseq alignments,
including 18 samples with support for all annotated
introns"
CDS <1..>199
/codon_start=1
/db_xref="GeneID:101821110"
/gene="BRK1"
/product="protein BRICK1"
/protein_id="XP_005063131.1"
/translation="EIHQDWANREYIEVITSSIKKIADFLNSFDMSCRSRLATLNEKLT
ALERRIEYIEARVSSGGWAGP"
ORIGIN
1 gagatccacc aggactgggc caaccgcgag tacatcgagg tgatcaccag ctccatcaag
61 aagatcgcgg acttcctcaa ctccttcgac atgtcgtgcc ggtcccgact ggccaccctg
121 aacgagaagc tgacggcgct ggagcggagg atcgagtaca tcgaggcccg ggtgagcagc
181 ggggggtggg ccgggccgg
//
It would seem that these are both partial records... We should check that more partials are not lurking in our dataset, but didn't throw a warning message because they happened to be the proper length for translation. To do so, we can again use --pull_records
command, but instead of restricting ourselves to the sequence IDs we'll pass in the optional parameter 'full' to search through the record comments as well.
sb brk1_mRNA.gb -pr 'partial' full | sb -ns
>> 2
So it's just those two records; probably best to delete them from our analysis with the --delete_records
command. By combining this command with the --in_place
modifier, the file will be overwritten with the appropriate records removed:
sb brk1_mRNA.gb -dr 'XM_005063074.1' 'XM_010176471.1' -i
>> # ####################### Deleted records ######################## #
>> XM_005063074.1
>> XM_010176471.1
>> # ################################################################ #
>> File overwritten at:
>> ~/BuddySuite/brk1_mRNA.gb
One other thing that seems to be missing from our records are annotations for phosphorylation sites. As we are all experts in BRK1 (*wink*), we know that it contains highly conserved binding motifs for both Protein Kinase C and Casein Kinase II. We can fetch this information using the --prosite_scan
command, which will send our sequences off to EMBL-EBI's implementation of PROSITE scan and automatically process the results into GenBank annotations. Let's also redirect the final output into a new file, because the Prosite Scan tool can take a few minutes to run:
$: sb brk1_mRNA.gb -efs CDS | sb -df 'gene' 'source' | sb -tr | sb -psc > brk1_mRNA_psc.gb
>> Running function _mc_run_prosite() on 21 cores
>> DONE: 151 jobs in 4 min, 15 sec
$: sb brk1_mRNA_psc.gb -pr 'NM_001200626'
LOCUS NM_001200626.1 76 aa linear VRT 10-JUL-2016
DEFINITION Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
subunit (brk1), mRNA.
ACCESSION NM_001200626
VERSION NM_001200626.1
KEYWORDS RefSeq.
SOURCE Ictalurus punctatus (channel catfish)
...
FEATURES Location/Qualifiers
CDS 1..76
/codon_start=1
/db_xref="GeneID:100528356"
/gene="brk1"
/note="probable protein brick1"
/product="protein BRICK1"
/protein_id="NP_001187555.1"
/translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
RLATLNEKLTALERRIEYIEARVTKGETLT"
PKC_PHOSPHO_SIT 28..30
PKC_PHOSPHO_SIT 42..44
CK2_PHOSPHO_SIT 49..52
CK2_PHOSPHO_SIT 55..58
CK2_PHOSPHO_SIT 69..72
ORIGIN
1 magqedpvqr eihqdwanre yievitssik kiadflnsfd mscrsrlatl nekltalerr
61 ieyiearvtk getlt*
//
Creating and manipulating multiple sequence alignment with AlignBuddy
Now it's time to bring AlignBuddy into the example to help us create a multiple sequence alignment. Very smart people have spent a good deal of time creating alignment tools, so AlignBuddy does not calculate alignments on its own. Instead, its --generate_alignment
tool handles all the pre-processing required to pass your sequences into third-party software like Clustal Omega or MAFFT, then it post-processes the results to retain sequence annotations.
$: alb brk1_mRNA_psc.gb -ga clustalo
>> AttributeError: clustalo is not a recognized alignment tool. Please check your spelling (case sensitive)
Whoops. Clustal Omega hasn't been installed on our computer! Go to the Clustal website to get the software, then make sure that 'clustalo' is visible in your PATH. If you have named your binary something else, like 'clustal_omega', or 'super_duper_aligner', simply pass in that name; AlignBuddy will figure out what it's supposed to do from there (you can also pass in the full path to the executable if you can't put it directly in your PATH).
Once you get the --generate_alignment
command working, redirect the output into a new alignment file. The use SeqBuddy to pull out our example record again.
$: alb brk1_mRNA_psc.gb -ga clustalomega > brk1_aln.gb
>> Using 24 threads
>> Read 151 sequences (type: Protein) from /Volumes/Zippy/.sysTemp/tmpneqazna1/tmp.fa
>> Using 52 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 151 sequences)
...
>> Alignment written to .sysTemp/tmpneqazna1/result
>> Returning to AlignBuddy...
$: sb brk1_aln.gb -pr 'NM_001200626.1'
LOCUS NM_001200626.1 1465 aa linear VRT 10-JUL-2016
DEFINITION Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
subunit (brk1), mRNA.
ACCESSION NM_001200626
VERSION NM_001200626.1
KEYWORDS RefSeq.
SOURCE Ictalurus punctatus (channel catfish)
...
FEATURES Location/Qualifiers
CDS 82..161
/codon_start=1
/db_xref="GeneID:100528356"
/gene="brk1"
/note="probable protein brick1"
/product="protein BRICK1"
/protein_id="NP_001187555.1"
/translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
RLATLNEKLTALERRIEYIEARVTKGETLT"
PKC_PHOSPHO_SIT 111..113
PKC_PHOSPHO_SIT 126..128
CK2_PHOSPHO_SIT 133..136
CK2_PHOSPHO_SIT 140..143
CK2_PHOSPHO_SIT 154..157
ORIGIN
1 ---------- ---------- ---------- ---------- ---------- ----------
61 ---------- ---------- -magqedpvq reih--qdwa nreyievits sikkiadfln
121 sfdms-crsr latlneklt- alerrieyie arvtkgetlt *--------- ----------
181 ---------- ---------- ---------- ---------- ---------- ----------
241 ---------- ---------- ---------- ---------- ---------- ----------
301 ---------- ---------- ---------- ---------- ---------- ----------
361 ---------- ---------- ---------- ---------- ---------- ----------
421 ---------- ---------- ---------- ---------- ---------- ----------
481 ---------- ---------- ---------- ---------- ---------- ----------
541 ---------- ---------- ---------- ---------- ---------- ----------
601 ---------- ---------- ---------- ---------- ---------- ----------
661 ---------- ---------- ---------- ---------- ---------- ----------
721 ---------- ---------- ---------- ---------- ---------- ----------
781 ---------- ---------- ---------- ---------- ---------- ----------
841 ---------- ---------- ---------- ---------- ---------- ----------
901 ---------- ---------- ---------- ---------- ---------- ----------
961 ---------- ---------- ---------- ---------- ---------- ----------
1021 ---------- ---------- ---------- ---------- ---------- ----------
1081 ---------- ---------- ---------- ---------- ---------- ----------
1141 ---------- ---------- ---------- ---------- ---------- ----------
1201 ---------- ---------- ---------- ---------- ---------- ----------
1261 ---------- ---------- ---------- ---------- ---------- ----------
1321 ---------- ---------- ---------- ---------- ---------- ----------
1381 ---------- ---------- ---------- ---------- ---------- ----------
1441 ---------- ---------- -----
//
So there are a few things to look at here. First, notice that the GenBank annotations have all been adjusted to account for the gaps; we'll be using this property to further assess our alignment in a little bit. We can also see that something is not quite right, because BRK1 is a highly conserved protein around 100 amino acids long, yet our alignment is almost 1500 amino acids long (most of which is gaps in this record). It might be easier to assess the situation if we use --screw_formats
to switch to a more conventional alignment format like CLUSTAL.
$: alb brk1_aln.gb -sf clustal
The output can be viewed here. It looks like record NM_129400.3 might be causing the problem, so let's take a quick peak at that record to see what's up:
$: alb brk1_aln.gb -pr 'NM_129400.3'
>> LOCUS NM_129400.3 1400 aa linear PLN 30-SEP-2016
>> DEFINITION Arabidopsis thaliana SCAR homolog 2 (SCAR2), mRNA.
>> ACCESSION NM_129400
>> VERSION NM_129400.3
...
>> 1321 vsemvhppik skqddkdsll aqirnksvnl kpavttrpsi qtgprtdlrv aailekanti
>> 1381 rmamagsded edsdswsds*
>> //
Ah ha! It's not a BRK1 gene. It's actually SCAR2, which is a BRK1 interacting protein. It must have been pulled into our DatabaseBuddy search because it's mentioned somewhere in its comments... It would have been better if we had of used the search term 'brk1[Gene Name]' (NCBI-specific syntax) instead of just 'brk1' when we searched GenBank, but no big deal. Let's go ahead and remove the problem record with AlignBuddy's version of --delete_records
; which, unlike the SeqBuddy version, will remove any columns that are 100% gaps after deleting records:
$: alb brk1_aln.gb -dr 'NM_129400.3' -i
>> # ####################### Deleted records ######################## #
>> # Alignment 1
>> NM_129400.3
>> # ################################################################ #
>> File over-written at:
>> ~/BuddySuite/brk1_aln.gb
And look at our example record again:
$: sb brk1_aln.gb -pr 'NM_001200626.1'
LOCUS NM_001200626.1 200 aa linear VRT 10-JUL-2016
DEFINITION Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
subunit (brk1), mRNA.
ACCESSION NM_001200626
VERSION NM_001200626.1
KEYWORDS RefSeq.
SOURCE Ictalurus punctatus (channel catfish)
...
FEATURES Location/Qualifiers
CDS 82..161
/codon_start=1
/db_xref="GeneID:100528356"
/gene="brk1"
/note="probable protein brick1"
/product="protein BRICK1"
/protein_id="NP_001187555.1"
/translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
RLATLNEKLTALERRIEYIEARVTKGETLT"
PKC_PHOSPHO_SIT 111..113
PKC_PHOSPHO_SIT 126..128
CK2_PHOSPHO_SIT 133..136
CK2_PHOSPHO_SIT 140..143
CK2_PHOSPHO_SIT 154..157
ORIGIN
1 ---------- ---------- ---------- ---------- ---------- ----------
61 ---------- ---------- -magqedpvq reih--qdwa nreyievits sikkiadfln
121 sfdms-crsr latlneklt- alerrieyie arvtkgetlt *--------- ----------
181 ---------- ----------
//
This seems much more reasonable. Let's say that BRK1 requires at least one PKC binding site between residues 25 and 45, which corresponds to aligned positions 109 through 129 in our example sequence. We can confirm that all 150 of our aligned sequences have this crucial motif by extracting the target range with --extract_regions
, and then using --pull_records_with_feature
to confirm a PKC site:
$: alb brk1_aln.gb -er '109:129' | sb -prf 'PKC_PHOSPHO_SIT' | alb -ns
>> 150
Looks good.
A common processing step before inferring a phylogenetic tree is to strip out poorly aligned columns. The TrimAl 'gappyout' algorithm has been reimplemented in AlignBuddy, so let's use it to clean up the alignment even further (--trimal
).
$: alb brk1_aln.gb -trm gappyout -o clustal
>> CLUSTAL X (1.81) multiple sequence alignment
>>
>>
>> NM_018462.4 MAGQEDPVQREIHQDWANREYIEIITSSIKKIADFLNSFDMSCRSRLATL
>> NM_127829.3 ---NAVNVGIAVQADWENREFISHISLNVRRLFEFLVQFESTTKSKLASL
>> NM_133937.1 MAGQEDPVQREIHQDWANREYIEIITSSIKKISDFLNSFDMSCRSRLATL
...
>> XM_019824977.1 NEKLTALERRIEYIEARVTKLPDSL
>> XM_019881410.1 NEKLTALERRIEYIEARVTKETLT*
>> XM_019984799.1 NEKLTALERRIEYIEARVTKETLT*
Doing a visual inspection of this alignment in CLUSTAL format convinces us we're in good shape to proceed with phylogenetic inference, so let's save the 'gappyout' version in GenBank format before moving on.
$: alb brk1_aln.gb -trm gappyout -i
>> File over-written at:
>> ~/BuddySuite/brk1_aln.gb
Creating and manipulating phylogenetic trees with PhyloBuddy
As was discussed for AlignBuddy, PhyloBuddy does not reinvent the wheel when it comes to inferring phylogenetic trees. Instead, its --generate_tree
command wraps third-party software like FastTree and RAxML. And just like AlignBuddy, all of the necessary format conversion will happen behind the scenes so you don't have to worry about it.
By default, FastTree uses the Jones-Taylor-Thorton model for amino acid alignments. For our example we'll override this and set the Whelan-And-Goldman model with the FastTree specific -wag
flag. Note that this parameter is enclosed by quotes, so PhyloBuddy knows to treat it as a FastTree flag and not try to interpret it itself. Also, remember that the fasttree
must be visible in your PATH (your executable may be called FastTree
; case matters).
$: pb brk1_aln.gb -gt fasttree "-wag"
>> FastTree Version 2.1.9 Double precision (No SSE3)
>> Alignment: /Volumes/Zippy/.sysTemp/tmp8iytnecj/pb_input.aln
>> Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
>> Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
>> TopHits: 1.00*sqrtN close=default refresh=0.80
>> ML Model: Whelan-And-Goldman, CAT approximation with 20 rate categories
>> Initial topology in 0.01 seconds
...
>> Total time: 2.23 seconds Unique: 41/150 Bad splits: 5/38 Worst delta-LogLk 1.085
>> Returning to PhyloBuddy...
>> (((('XM_005870077.2':0.015843995,(((('XM_003464080.3':0.0,... 'XM_008982139.2':0.01950925)0.204:0.000280801);
The result has been returned in Newick format, but you can change this to Nexus or NeXML if desired with the --output
modifier (we'll also suppress the output from FastTree using the --quiet
flag):
$: pb brk1_aln.gb -gt fasttree "-wag" -o nexus -q
>> #NEXUS
>>
>> BEGIN TAXA;
>> DIMENSIONS NTAX=150;
>> TAXLABELS
>> 'XM_005870077.2'
...
>> BEGIN TREES;
>> TREE 1 = (((('XM_005870077.2':0.015843995 ... 'XM_008982139.2':0.01950925)0.204:0.000280801);
>> END;
Or perhaps you'd like to see the tree represented graphically? Pipe the results into the --display_trees
command:
$: pb brk1_aln.gb -gt fasttree "-wag" -q | pb -dt
>>
Let's finish by rooting the tree on the three plant sequences in our dataset (Zea mays "NM_001111720.1", Arabidopsis "NM_127829.3", and Physcomitrella "XM_001758951.1") with the --root
command:
$: pb brk1_aln.gb -gt fasttree "-wag" -q | pb -rt "NM_001111720.1" "NM_127829.3" "XM_001758951.1" | pb -dt
>>
And that's that! Most of what we've done here could be cobbled together with little bash scripts, 3rd party software, and web tools, but hopefully you can appreciate the power of having all of this functionality available from the same place. Once the sequences were retrieved from GenBank, the entire workflow could be executed as a single line:
$: sb brk1_mRNA.gb -dr 'XM_005063074.1' 'XM_010176471.1' 'NM_129400.3' | sb -efs CDS | sb -tr | alb -ga clustalomega | pb -gt fasttree "-wag" | pb -rt "NM_001111720.1" "NM_127829.3" "XM_001758951.1" | pb -dt
>> # ####################### Deleted records ######################## #
>> XM_005063074.1
>> XM_010176471.1
>> NM_129400.3
>> # ################################################################ #
>> Using 24 threads
>> Read 150 sequences (type: Protein) from /Volumes/Zippy/.sysTemp/tmpk3rh4pe2/tmp.fa
>> Using 52 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 150 sequences)
>> Calculating pairwise ktuple-distances...
>> Ktuple-distance calculation progress done. CPU time: 0.16u 0.07s 00:00:00.23 Elapsed: 00:00:00
>> mBed created 5 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)
>> Distance calculation within sub-clusters done. CPU time: 0.12u 0.07s 00:00:00.19 Elapsed: 00:00:00
>> Guide-tree computation (mBed) done.
>> Progressive alignment progress done. CPU time: 0.35u 0.02s 00:00:00.37 Elapsed: 00:00:01
>> Alignment written to /Volumes/Zippy/.sysTemp/tmpk3rh4pe2/result
>> Returning to AlignBuddy...
>>
>> FastTree Version 2.1.9 Double precision (No SSE3)
>> Alignment: /Volumes/Zippy/.sysTemp/tmpcc5qwumv/pb_input.aln
>> Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
>> Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
>> TopHits: 1.00*sqrtN close=default refresh=0.80
>> ML Model: Whelan-And-Goldman, CAT approximation with 20 rate categories
>> Initial topology in 0.01 seconds
>> Refining topology: 21 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
>> Total branch-length 3.157 after 0.11 sec1, 1 of 39 splits
>> ML-NNI round 1: LogLk = -2387.099 NNIs 15 max delta 5.07 Time 0.53
>> Switched to using 20 rate categories (CAT approximation)19 of 20
>> Rate categories were divided by 0.803 so that average rate = 1.0
>> CAT-based log-likelihoods may not be comparable across runs
>> Use -gamma for approximate but comparable Gamma(20) log-likelihoods
>> ML-NNI round 2: LogLk = -2259.258 NNIs 8 max delta 1.94 Time 1.03
>> ML-NNI round 3: LogLk = -2256.921 NNIs 7 max delta 0.11 Time 1.29
>> ML-NNI round 4: LogLk = -2256.886 NNIs 6 max delta 0.01 Time 1.57
>> Turning off heuristics for final round of ML NNIs (converged)
>> ML-NNI round 5: LogLk = -2256.343 NNIs 6 max delta 0.00 Time 1.97 (final)
>> Optimize all lengths: LogLk = -2256.338 Time 2.06
>> Total time: 2.37 seconds Unique: 41/150 Bad splits: 0/38
>> Returning to PhyloBuddy...
>>