Skip to content
Steve Bond edited this page Apr 12, 2017 · 22 revisions

Getting to know BuddySuite

So you've heard that the BuddySuite toolkits can make your life easier and you've installed it on your machine, but now what? There are over 100 tools spread across four separate programs (SeqBuddy, AlignBuddy, PhyloBuddy, and DatabaseBuddy), which is a bit much to just dive right into.

To get you started, the following tutorial is a walk-through of a full workflow that begins by downloading a collection of RefSeq mRNAs and ends with a phylogenetic tree. This should give you a nice sampling of the BuddySuite commands, familiarize you with the general syntax and scope of the tools available, and hopefully whet your appetite for exploring the rest of the suite.

Note that we will be using short-form aliases of each toolkit throughout the tutorial (alb=AlignBuddy, db=DatabaseBuddy, pb=PhyloBuddy, sb=SeqBuddy), while tool flags will be given in long-form in the description and short-form in the actual example (e.g., --pull_records, -pr).

So let's begin...

Download mRNAs for the BRICK1 gene (BRK1) from GenBank using DatabaseBuddy

Below is a screen capture of a DatabaseBuddy session where RefSeq mRNAs are retrieved from GenBank for the cytoskeleton protein BRK1. As you replicate each step yourself, take a moment to familiarize yourself with the commands being used by typing help and then the keyword before you run the actual example command. For instance, before running the first command database ncbi_nuc, do the following:

DbBuddy> help database

>> Reset the database(s) to be searched. Separate multiple databases with spaces.
>> Currently set to: ncbi_nuc, ncbi_prot, uniprot, ensembl
>> Valid choices: all, ncbi_nuc, ncbi_prot, uniprot, ensembl

The output will be green in your terminal window, and explains what the database command does. You can also type the word help on it's own to get a full list of available commands.

(Click to enlarge)

One common 'gotcha' with DatabaseBuddy happens during the final write step. You MUST first specify the format you want to write your fetched sequences in (in this case, we selected GenBank format), otherwise you will create a summary file with no actual sequence data.

Analyze and tidy up BRK1 sequences using SeqBuddy

Approximately 150 mRNAs should have been retrieved during the DatabaseBuddy session. To get an exact count, use the --num_seqs command:

$: sb brk1_mRNA.gb -ns

>> 153

You may get slightly different results because GenBank is being continuously updated, but you shouldn't have a wildly different number than 153. If you wish to work with the exact sequences used to create this tutorial, they can be found here.

Now let's have a look at what was actually downloaded by listing all of the sequence IDs with the --list-ids command. By passing a number into the optional columns argument, the output will be organized into that many columns (in this case, five):

$: sb brk1_mRNA.gb -li 5

>> XM_015069351.1       XM_002925023.3  XM_014607950.2  XM_003217549.3  XM_012471870.1
>> XM_012441180.1       NM_129400.3     NM_127829.3     XM_007254551.2  XM_014006305.1
...
>> XM_008697894.1       XM_006206815.2  NM_001171700.1  NM_001171693.1  NM_001015994.2
>> XM_018093034.1       XM_005796913.1  NM_001111720.1

It looks like many sequences have not been curated (i.e., they are labelled with the XM designation). Let's see exactly how many of the sequences are actually curated by piping the results of the --pull_records command into the --num_seqs command:

$: sb brk1_mRNA.gb -pr "NM_" | sb -ns

>> 14

Piping data to and from BuddySuite tools is a core feature of the suite. It doesn't matter whether the piped data comes from a BuddySuite tool or not, so long as the data is sent to stdout in any of the sequence, alignment, or tree formats supported by the receiving tool.

Be careful though! Not all BuddySuite tools actually return sequence, alignment, or tree data. We just saw this in the previous example, because --num_seqs returns a number. If we tried to pipe this back into --list-ids, it would fail:

$: sb brk1_mRNA.gb -pr "NM_" | sb -ns | sb -li 5

>> GuessError: Could not determine format from input file-like object
>>  --> 14
>>  ...
>> Try explicitly setting with -f flag.

The term 'file-like' object is a bit cryptic, as it refers to how SeqBuddy is trying to read the data from the terminal, but if you run into an error message like this check to see if the output from one of your steps is non-sequence data.

Fourteen records is a manageable number to visually inspect, so have a quick look at them:

$: sb brk1_mRNA.gb -pr "NM_"

>> LOCUS       NM_166648                349 bp    DNA     linear   INV 13-DEC-2016
>> DEFINITION  Drosophila melanogaster haematopoietic stem/progenitor cell protein
>>             300, transcript variant A (HSPC300), mRNA.
...
>>      4801 ctatgtatgt gtatttgaat acatacatag atatgtattg tgtgttctca atcatttgca
>>      4861 gtgtcctttg actgtatatt tggtc
>> //

There is quite a bit of meta data associated with these GenBank records, so for clarity we will often pull out a single record ($: sb brk1_mRNA.gb -pr 'NM_001200626') in the following steps as a sanity check:

LOCUS       NM_001200626             439 bp    DNA     linear   VRT 10-JUL-2016
DEFINITION  Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
            subunit (brk1), mRNA.
ACCESSION   NM_001200626
VERSION     NM_001200626.1
KEYWORDS    RefSeq.
SOURCE      Ictalurus punctatus (channel catfish)
...
FEATURES             Location/Qualifiers
     source          1..439
                     /chromosome="21"
                     /db_xref="taxon:7998"
                     /map="21"
                     /mol_type="mRNA"
                     /organism="Ictalurus punctatus"
     gene            1..439
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="BRICK1, SCAR/WAVE actin nucleating complex subunit"
     CDS             16..243
                     /codon_start=1
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="probable protein brick1"
                     /product="protein BRICK1"
                     /protein_id="NP_001187555.1"
                     /translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
                     RLATLNEKLTALERRIEYIEARVTKGETLT"
ORIGIN
        1 gcagctggag ggaaaatggc cggccaggag gatccagtgc agagagaaat ccaccaggat
       61 tgggcgaatc gcgagtatat tgaagtgatt acaagcagca tcaagaaaat agccgatttc
      121 ctcaactcgt ttgatatgtc ctgcaggtcc cgtttagcca cgctgaacga gaagctcacc
      181 gccctggaga gaaggatcga gtacatcgaa gcccgggtta caaaagggga aacgctgacc
      241 taaccatccc gacgctgacg actggacaac atggccacac tcctgtttct ttttcttttc
      301 ttcccctcta tcctcattca ccttaaataa ccaagcgact gtttttcaac cagtctctta
      361 aagccccagt gaggaactgc tgatacagtg ctacgctgcg ttaggactta aaaaaggaac
      421 atttaaaaaa aaaaaaaaa
//

This is a pretty standard looking messenger RNA, with 3' and 5' untranslated regions and a classic poly-A tail. Notice that the actual coding sequence (CDS) is listed as an annotation, spanning residues 16-243. Every other record in our subset of 14 curated sequences also has a CDS annotation, which is what we want to focus on for our forthcoming phylogenetic analysis. We should still confirm that all 153 sequences in our dataset have a CDS annotation, however, so let's use the --pull_records_with_feature command:

$: sb brk1_mRNA.gb -prf CDS | sb -ns

>> 153

All 153 sequences were returned, meaning all are appropriately annotated. Now we should pull each CDS out of its record using --extract_feature_sequences.

$: sb brk1_mRNA.gb -efs CDS

>> LOCUS       XM_009569415             114 bp    DNA              UNK 01-JAN-1980
>> DEFINITION  PREDICTED: Cuculus canorus BRICK1, SCAR/WAVE actin-nucleating
...
>>       181 atagagtaca ttgaagcacg ggtgacaaaa ggtgagaccc tcacctag
>> //

Again, let's just look at NM_001200626:

LOCUS       NM_001200626             228 bp    DNA     linear   VRT 10-JUL-2016
DEFINITION  Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
            subunit (brk1), mRNA.
ACCESSION   NM_001200626
VERSION     NM_001200626.1
KEYWORDS    RefSeq.
SOURCE      Ictalurus punctatus (channel catfish)
...
FEATURES             Location/Qualifiers
     source          1..228
                     /chromosome="21"
                     /db_xref="taxon:7998"
                     /map="21"
                     /mol_type="mRNA"
                     /organism="Ictalurus punctatus"
     gene            1..228
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="BRICK1, SCAR/WAVE actin nucleating complex subunit"
     CDS             1..228
                     /codon_start=1
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="probable protein brick1"
                     /product="protein BRICK1"
                     /protein_id="NP_001187555.1"
                     /translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
                     RLATLNEKLTALERRIEYIEARVTKGETLT"
ORIGIN
        1 atggccggcc aggaggatcc agtgcagaga gaaatccacc aggattgggc gaatcgcgag
       61 tatattgaag tgattacaag cagcatcaag aaaatagccg atttcctcaa ctcgtttgat
      121 atgtcctgca ggtcccgttt agccacgctg aacgagaagc tcaccgccct ggagagaagg
      181 atcgagtaca tcgaagcccg ggttacaaaa ggggaaacgc tgacctaa
//

Notice how all of the annotations now span the same 228 residues? When SeqBuddy pulled out the subsequence, it also shifted the annotations as needed. Part of the 'gene' and 'source' sequence was deleted, but what remains is still annotated; of course this information is no longer useful, so we can clean the records up a bit by removing those two annotations with the --delete_features command:

$: sb brk1_mRNA.gb -efs CDS | sb -df 'gene' 'source'
LOCUS       NM_001200626             228 bp    DNA     linear   VRT 10-JUL-2016
DEFINITION  Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
            subunit (brk1), mRNA.
ACCESSION   NM_001200626
VERSION     NM_001200626.1
KEYWORDS    RefSeq.
SOURCE      Ictalurus punctatus (channel catfish)
...
FEATURES             Location/Qualifiers
     CDS             1..228
                     /codon_start=1
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="probable protein brick1"
                     /product="protein BRICK1"
                     /protein_id="NP_001187555.1"
                     /translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
                     RLATLNEKLTALERRIEYIEARVTKGETLT"
ORIGIN
        1 atggccggcc aggaggatcc agtgcagaga gaaatccacc aggattgggc gaatcgcgag
       61 tatattgaag tgattacaag cagcatcaag aaaatagccg atttcctcaa ctcgtttgat
      121 atgtcctgca ggtcccgttt agccacgctg aacgagaagc tcaccgccct ggagagaagg
      181 atcgagtaca tcgaagcccg ggttacaaaa ggggaaacgc tgacctaa
//

Great! So we can isolate the coding sequences for each of our sequences and tidy up the record a little bit. For the phylogeny we're going to use amino acid alignments instead of nucleotide, and the --translate tool can handle the conversion:

$: sb brk1_mRNA.gb -efs CDS | sb -df 'gene' 'source' | sb -tr

>> Warning: size mismatch between aa and nucl seqs for XM_010176471.1 --> 146, 48
>> Warning: size mismatch between aa and nucl seqs for XM_005063074.1 --> 199, 66
>> LOCUS       XM_009569415              38 aa                     UNK 01-JAN-1980
>> DEFINITION  PREDICTED: Cuculus canorus BRICK1, SCAR/WAVE actin-nucleating
...
>> ORIGIN
>>         1 magqedlvqr eidqdwanre yievitssik kiadflnsfd mscrsrlatl nekltalerr
>>        61 ieyiearvtk getlt*
>> //

Uh-oh... There's a warning message telling us that there's something a little weird with records XM_010176471.1 and XM_005063074.1. Use --pull_records to check them out:

sb brk1_mRNA.gb -pr 'XM_005063074.1' 'XM_010176471.1'
LOCUS       XM_005063074             199 bp    DNA     linear   VRT 20-APR-2016
DEFINITION  PREDICTED: Ficedula albicollis BRICK1, SCAR/WAVE actin nucleating
            complex subunit (BRK1), partial mRNA.
...
COMMENT     MODEL REFSEQ:  This record is predicted by automated computational
            analysis. This record is derived from a genomic sequence
            (NW_004792212.1) annotated using gene prediction method: Gnomon.
            Also see:
            Documentation of NCBI's Annotation Process
            Release 101
            pipeline
            COMPLETENESS: incomplete on both ends.
FEATURES             Location/Qualifiers
     source          1..199
                     /chromosome="Unknown"
                     /collection_date="2009"
                     /country="Sweden: Oland"
                     /db_xref="taxon:59894"
                     /isolate="OC2"
                     /mol_type="mRNA"
                     /organism="Ficedula albicollis"
                     /sex="male"
     gene            <1..>199
                     /db_xref="GeneID:101821110"
                     /gene="BRK1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon. Supporting evidence
                     includes similarity to: 11 Proteins, and 84% coverage of
                     the annotated genomic feature by RNAseq alignments,
                     including 18 samples with support for all annotated
                     introns"
     CDS             <1..>199
                     /codon_start=1
                     /db_xref="GeneID:101821110"
                     /gene="BRK1"
                     /product="protein BRICK1"
                     /protein_id="XP_005063131.1"
                     /translation="EIHQDWANREYIEVITSSIKKIADFLNSFDMSCRSRLATLNEKLT
                     ALERRIEYIEARVSSGGWAGP"
ORIGIN
        1 gagatccacc aggactgggc caaccgcgag tacatcgagg tgatcaccag ctccatcaag
       61 aagatcgcgg acttcctcaa ctccttcgac atgtcgtgcc ggtcccgact ggccaccctg
      121 aacgagaagc tgacggcgct ggagcggagg atcgagtaca tcgaggcccg ggtgagcagc
      181 ggggggtggg ccgggccgg
//

It would seem that these are both partial records... We should check that more partials are not lurking in our dataset, but didn't throw a warning message because they happened to be the proper length for translation. To do so, we can again use --pull_records command, but instead of restricting ourselves to the sequence IDs we'll pass in the optional parameter 'full' to search through the record comments as well.

sb brk1_mRNA.gb -pr 'partial' full | sb -ns

>> 2

So it's just those two records; probably best to delete them from our analysis with the --delete_records command. By combining this command with the --in_place modifier, the file will be overwritten with the appropriate records removed:

sb brk1_mRNA.gb -dr 'XM_005063074.1' 'XM_010176471.1' -i

>> # ####################### Deleted records ######################## #
>> XM_005063074.1
>> XM_010176471.1
>> # ################################################################ #
>> File overwritten at:
>> ~/BuddySuite/brk1_mRNA.gb

One other thing that seems to be missing from our records are annotations for phosphorylation sites. As we are all experts in BRK1 (*wink*), we know that it contains highly conserved binding motifs for both Protein Kinase C and Casein Kinase II. We can fetch this information using the --prosite_scan command, which will send our sequences off to EMBL-EBI's implementation of PROSITE scan and automatically process the results into GenBank annotations. Let's also redirect the final output into a new file, because the Prosite Scan tool can take a few minutes to run:

$: sb brk1_mRNA.gb -efs CDS | sb -df 'gene' 'source' | sb -tr | sb -psc > brk1_mRNA_psc.gb

>> Running function _mc_run_prosite() on 21 cores
>>     DONE: 151 jobs in 4 min, 15 sec

$: sb brk1_mRNA_psc.gb -pr 'NM_001200626'
LOCUS       NM_001200626.1            76 aa            linear   VRT 10-JUL-2016
DEFINITION  Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
            subunit (brk1), mRNA.
ACCESSION   NM_001200626
VERSION     NM_001200626.1
KEYWORDS    RefSeq.
SOURCE      Ictalurus punctatus (channel catfish)
...
FEATURES             Location/Qualifiers
     CDS             1..76
                     /codon_start=1
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="probable protein brick1"
                     /product="protein BRICK1"
                     /protein_id="NP_001187555.1"
                     /translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
                     RLATLNEKLTALERRIEYIEARVTKGETLT"
     PKC_PHOSPHO_SIT 28..30
     PKC_PHOSPHO_SIT 42..44
     CK2_PHOSPHO_SIT 49..52
     CK2_PHOSPHO_SIT 55..58
     CK2_PHOSPHO_SIT 69..72
ORIGIN
        1 magqedpvqr eihqdwanre yievitssik kiadflnsfd mscrsrlatl nekltalerr
       61 ieyiearvtk getlt*
//

Creating and manipulating multiple sequence alignment with AlignBuddy

Now it's time to bring AlignBuddy into the example to help us create a multiple sequence alignment. Very smart people have spent a good deal of time creating alignment tools, so AlignBuddy does not calculate alignments on its own. Instead, its --generate_alignment tool handles all the pre-processing required to pass your sequences into third-party software like Clustal Omega or MAFFT, then it post-processes the results to retain sequence annotations.

$: alb brk1_mRNA_psc.gb -ga clustalo

>> AttributeError: clustalo is not a recognized alignment tool. Please check your spelling (case sensitive)

Whoops. Clustal Omega hasn't been installed on our computer! Go to the Clustal website to get the software, then make sure that 'clustalo' is visible in your PATH. If you have named your binary something else, like 'clustal_omega', or 'super_duper_aligner', simply pass in that name; AlignBuddy will figure out what it's supposed to do from there (you can also pass in the full path to the executable if you can't put it directly in your PATH).

Once you get the --generate_alignment command working, redirect the output into a new alignment file. The use SeqBuddy to pull out our example record again.

$: alb brk1_mRNA_psc.gb -ga clustalomega > brk1_aln.gb

>> Using 24 threads
>> Read 151 sequences (type: Protein) from /Volumes/Zippy/.sysTemp/tmpneqazna1/tmp.fa
>> Using 52 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 151 sequences)
...
>> Alignment written to .sysTemp/tmpneqazna1/result
>> Returning to AlignBuddy...

$: sb brk1_aln.gb -pr 'NM_001200626.1'
LOCUS       NM_001200626.1          1465 aa            linear   VRT 10-JUL-2016
DEFINITION  Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
            subunit (brk1), mRNA.
ACCESSION   NM_001200626
VERSION     NM_001200626.1
KEYWORDS    RefSeq.
SOURCE      Ictalurus punctatus (channel catfish)
...
FEATURES             Location/Qualifiers
     CDS             82..161
                     /codon_start=1
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="probable protein brick1"
                     /product="protein BRICK1"
                     /protein_id="NP_001187555.1"
                     /translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
                     RLATLNEKLTALERRIEYIEARVTKGETLT"
     PKC_PHOSPHO_SIT 111..113
     PKC_PHOSPHO_SIT 126..128
     CK2_PHOSPHO_SIT 133..136
     CK2_PHOSPHO_SIT 140..143
     CK2_PHOSPHO_SIT 154..157
ORIGIN
        1 ---------- ---------- ---------- ---------- ---------- ----------
       61 ---------- ---------- -magqedpvq reih--qdwa nreyievits sikkiadfln
      121 sfdms-crsr latlneklt- alerrieyie arvtkgetlt *--------- ----------
      181 ---------- ---------- ---------- ---------- ---------- ----------
      241 ---------- ---------- ---------- ---------- ---------- ----------
      301 ---------- ---------- ---------- ---------- ---------- ----------
      361 ---------- ---------- ---------- ---------- ---------- ----------
      421 ---------- ---------- ---------- ---------- ---------- ----------
      481 ---------- ---------- ---------- ---------- ---------- ----------
      541 ---------- ---------- ---------- ---------- ---------- ----------
      601 ---------- ---------- ---------- ---------- ---------- ----------
      661 ---------- ---------- ---------- ---------- ---------- ----------
      721 ---------- ---------- ---------- ---------- ---------- ----------
      781 ---------- ---------- ---------- ---------- ---------- ----------
      841 ---------- ---------- ---------- ---------- ---------- ----------
      901 ---------- ---------- ---------- ---------- ---------- ----------
      961 ---------- ---------- ---------- ---------- ---------- ----------
     1021 ---------- ---------- ---------- ---------- ---------- ----------
     1081 ---------- ---------- ---------- ---------- ---------- ----------
     1141 ---------- ---------- ---------- ---------- ---------- ----------
     1201 ---------- ---------- ---------- ---------- ---------- ----------
     1261 ---------- ---------- ---------- ---------- ---------- ----------
     1321 ---------- ---------- ---------- ---------- ---------- ----------
     1381 ---------- ---------- ---------- ---------- ---------- ----------
     1441 ---------- ---------- -----
//

So there are a few things to look at here. First, notice that the GenBank annotations have all been adjusted to account for the gaps; we'll be using this property to further assess our alignment in a little bit. We can also see that something is not quite right, because BRK1 is a highly conserved protein around 100 amino acids long, yet our alignment is almost 1500 amino acids long (most of which is gaps in this record). It might be easier to assess the situation if we use --screw_formats to switch to a more conventional alignment format like CLUSTAL.

$: alb brk1_aln.gb -sf clustal

The output can be viewed here. It looks like record NM_129400.3 might be causing the problem, so let's take a quick peak at that record to see what's up:

$: alb brk1_aln.gb -pr 'NM_129400.3'

>> LOCUS       NM_129400.3             1400 aa            linear   PLN 30-SEP-2016
>> DEFINITION  Arabidopsis thaliana SCAR homolog 2 (SCAR2), mRNA.
>> ACCESSION   NM_129400
>> VERSION     NM_129400.3
...
>>      1321 vsemvhppik skqddkdsll aqirnksvnl kpavttrpsi qtgprtdlrv aailekanti
>>      1381 rmamagsded edsdswsds*
>> //

Ah ha! It's not a BRK1 gene. It's actually SCAR2, which is a BRK1 interacting protein. It must have been pulled into our DatabaseBuddy search because it's mentioned somewhere in its comments... It would have been better if we had of used the search term 'brk1[Gene Name]' (NCBI-specific syntax) instead of just 'brk1' when we searched GenBank, but no big deal. Let's go ahead and remove the problem record with AlignBuddy's version of --delete_records; which, unlike the SeqBuddy version, will remove any columns that are 100% gaps after deleting records:

$: alb brk1_aln.gb -dr 'NM_129400.3' -i

>> # ####################### Deleted records ######################## #
>> # Alignment 1
>> NM_129400.3
>> # ################################################################ #
>> File over-written at:
>> ~/BuddySuite/brk1_aln.gb

And look at our example record again:

$: sb brk1_aln.gb -pr 'NM_001200626.1'
LOCUS       NM_001200626.1           200 aa            linear   VRT 10-JUL-2016
DEFINITION  Ictalurus punctatus BRICK1, SCAR/WAVE actin nucleating complex
            subunit (brk1), mRNA.
ACCESSION   NM_001200626
VERSION     NM_001200626.1
KEYWORDS    RefSeq.
SOURCE      Ictalurus punctatus (channel catfish)
...
FEATURES             Location/Qualifiers
     CDS             82..161
                     /codon_start=1
                     /db_xref="GeneID:100528356"
                     /gene="brk1"
                     /note="probable protein brick1"
                     /product="protein BRICK1"
                     /protein_id="NP_001187555.1"
                     /translation="MAGQEDPVQREIHQDWANREYIEVITSSIKKIADFLNSFDMSCRS
                     RLATLNEKLTALERRIEYIEARVTKGETLT"
     PKC_PHOSPHO_SIT 111..113
     PKC_PHOSPHO_SIT 126..128
     CK2_PHOSPHO_SIT 133..136
     CK2_PHOSPHO_SIT 140..143
     CK2_PHOSPHO_SIT 154..157
ORIGIN
        1 ---------- ---------- ---------- ---------- ---------- ----------
       61 ---------- ---------- -magqedpvq reih--qdwa nreyievits sikkiadfln
      121 sfdms-crsr latlneklt- alerrieyie arvtkgetlt *--------- ----------
      181 ---------- ----------
//

This seems much more reasonable. Let's say that BRK1 requires at least one PKC binding site between residues 25 and 45, which corresponds to aligned positions 109 through 129 in our example sequence. We can confirm that all 150 of our aligned sequences have this crucial motif by extracting the target range with --extract_regions, and then using --pull_records_with_feature to confirm a PKC site:

$: alb brk1_aln.gb -er '109:129' | sb -prf 'PKC_PHOSPHO_SIT' | alb -ns

>> 150

Looks good.

A common processing step before inferring a phylogenetic tree is to strip out poorly aligned columns. The TrimAl 'gappyout' algorithm has been reimplemented in AlignBuddy, so let's use it to clean up the alignment even further (--trimal).

$: alb brk1_aln.gb -trm gappyout -o clustal

>> CLUSTAL X (1.81) multiple sequence alignment
>> 
>> 
>> NM_018462.4                         MAGQEDPVQREIHQDWANREYIEIITSSIKKIADFLNSFDMSCRSRLATL
>> NM_127829.3                         ---NAVNVGIAVQADWENREFISHISLNVRRLFEFLVQFESTTKSKLASL
>> NM_133937.1                         MAGQEDPVQREIHQDWANREYIEIITSSIKKISDFLNSFDMSCRSRLATL
...
>> XM_019824977.1                      NEKLTALERRIEYIEARVTKLPDSL
>> XM_019881410.1                      NEKLTALERRIEYIEARVTKETLT*
>> XM_019984799.1                      NEKLTALERRIEYIEARVTKETLT*

Doing a visual inspection of this alignment in CLUSTAL format convinces us we're in good shape to proceed with phylogenetic inference, so let's save the 'gappyout' version in GenBank format before moving on.

$: alb brk1_aln.gb -trm gappyout -i

>> File over-written at:
>> ~/BuddySuite/brk1_aln.gb

Creating and manipulating phylogenetic trees with PhyloBuddy

As was discussed for AlignBuddy, PhyloBuddy does not reinvent the wheel when it comes to inferring phylogenetic trees. Instead, its --generate_tree command wraps third-party software like FastTree and RAxML. And just like AlignBuddy, all of the necessary format conversion will happen behind the scenes so you don't have to worry about it.

By default, FastTree uses the Jones-Taylor-Thorton model for amino acid alignments. For our example we'll override this and set the Whelan-And-Goldman model with the FastTree specific -wag flag. Note that this parameter is enclosed by quotes, so PhyloBuddy knows to treat it as a FastTree flag and not try to interpret it itself. Also, remember that the fasttree must be visible in your PATH (your executable may be called FastTree; case matters).

$: pb brk1_aln.gb -gt fasttree "-wag"

>> FastTree Version 2.1.9 Double precision (No SSE3)
>> Alignment: /Volumes/Zippy/.sysTemp/tmp8iytnecj/pb_input.aln
>> Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
>> Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
>> TopHits: 1.00*sqrtN close=default refresh=0.80
>> ML Model: Whelan-And-Goldman, CAT approximation with 20 rate categories
>> Initial topology in 0.01 seconds
...
>> Total time: 2.23 seconds Unique: 41/150 Bad splits: 5/38 Worst delta-LogLk 1.085
>> Returning to PhyloBuddy...

>> (((('XM_005870077.2':0.015843995,(((('XM_003464080.3':0.0,... 'XM_008982139.2':0.01950925)0.204:0.000280801);

The result has been returned in Newick format, but you can change this to Nexus or NeXML if desired with the --output modifier (we'll also suppress the output from FastTree using the --quiet flag):

$: pb brk1_aln.gb -gt fasttree "-wag" -o nexus -q

>> #NEXUS
>> 
>> BEGIN TAXA;
>>     DIMENSIONS NTAX=150;
>>     TAXLABELS
>>         'XM_005870077.2'
...
>> BEGIN TREES;
>>     TREE 1 = (((('XM_005870077.2':0.015843995 ... 'XM_008982139.2':0.01950925)0.204:0.000280801);
>> END;

Or perhaps you'd like to see the tree represented graphically? Pipe the results into the --display_trees command:

$: pb brk1_aln.gb -gt fasttree "-wag" -q | pb -dt

>>

(Click to enlarge)

Let's finish by rooting the tree on the three plant sequences in our dataset (Zea mays "NM_001111720.1", Arabidopsis "NM_127829.3", and Physcomitrella "XM_001758951.1") with the --root command:

$: pb brk1_aln.gb -gt fasttree "-wag" -q | pb -rt "NM_001111720.1" "NM_127829.3" "XM_001758951.1" | pb -dt

>>

(Click to enlarge)

And that's that! Most of what we've done here could be cobbled together with little bash scripts, 3rd party software, and web tools, but hopefully you can appreciate the power of having all of this functionality available from the same place. Once the sequences were retrieved from GenBank, the entire workflow could be executed as a single line:

$: sb brk1_mRNA.gb -dr 'XM_005063074.1' 'XM_010176471.1' 'NM_129400.3' | sb -efs CDS | sb -tr | alb -ga clustalomega | pb -gt fasttree "-wag" | pb -rt "NM_001111720.1" "NM_127829.3" "XM_001758951.1" | pb -dt

>> # ####################### Deleted records ######################## #
>> XM_005063074.1
>> XM_010176471.1
>> NM_129400.3
>> # ################################################################ #
>> Using 24 threads
>> Read 150 sequences (type: Protein) from /Volumes/Zippy/.sysTemp/tmpk3rh4pe2/tmp.fa
>> Using 52 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 150 sequences)
>> Calculating pairwise ktuple-distances...
>> Ktuple-distance calculation progress done. CPU time: 0.16u 0.07s 00:00:00.23 Elapsed: 00:00:00
>> mBed created 5 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)
>> Distance calculation within sub-clusters done. CPU time: 0.12u 0.07s 00:00:00.19 Elapsed: 00:00:00
>> Guide-tree computation (mBed) done.
>> Progressive alignment progress done. CPU time: 0.35u 0.02s 00:00:00.37 Elapsed: 00:00:01
>> Alignment written to /Volumes/Zippy/.sysTemp/tmpk3rh4pe2/result
>> Returning to AlignBuddy...
>> 
>> FastTree Version 2.1.9 Double precision (No SSE3)
>> Alignment: /Volumes/Zippy/.sysTemp/tmpcc5qwumv/pb_input.aln
>> Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
>> Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
>> TopHits: 1.00*sqrtN close=default refresh=0.80
>> ML Model: Whelan-And-Goldman, CAT approximation with 20 rate categories
>> Initial topology in 0.01 seconds
>> Refining topology: 21 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
>> Total branch-length 3.157 after 0.11 sec1, 1 of 39 splits
>> ML-NNI round 1: LogLk = -2387.099 NNIs 15 max delta 5.07 Time 0.53
>> Switched to using 20 rate categories (CAT approximation)19 of 20
>> Rate categories were divided by 0.803 so that average rate = 1.0
>> CAT-based log-likelihoods may not be comparable across runs
>> Use -gamma for approximate but comparable Gamma(20) log-likelihoods
>> ML-NNI round 2: LogLk = -2259.258 NNIs 8 max delta 1.94 Time 1.03
>> ML-NNI round 3: LogLk = -2256.921 NNIs 7 max delta 0.11 Time 1.29
>> ML-NNI round 4: LogLk = -2256.886 NNIs 6 max delta 0.01 Time 1.57
>> Turning off heuristics for final round of ML NNIs (converged)
>> ML-NNI round 5: LogLk = -2256.343 NNIs 6 max delta 0.00 Time 1.97 (final)
>> Optimize all lengths: LogLk = -2256.338 Time 2.06
>> Total time: 2.37 seconds Unique: 41/150 Bad splits: 0/38
>> Returning to PhyloBuddy...
>> 

(Click to enlarge)

Main Toolkit Pages





Further Reading

Clone this wiki locally