Skip to content

AB Concatenate Alignments

Steve Bond edited this page Jun 20, 2016 · 5 revisions

--concat_alignments, -cta

Description

Concatenates two or more alignments into a single alignment.

Records from each alignment are grouped together based on some shared identifier in their record IDs (e.g., an organism name), and each identifier must be present exactly 0 or 1 times in each alignment. As explained further below, there is a high degree of flexibility in how you specify how sequences should be grouped together: Auto-detection, fixed length prefix/suffix, or regular expression.

Arguments

If you pass in no arguments, this tool will analyze the IDs of each sequence and select a prefix with the minimum length necessary to ensure unique identification within each alignment, and then use these prefixes to group records among alignments (see example 1).

Grouping pattern ( regex or int )

Optional. Passing in a positive integer will use a fixed-length prefix from each record ID to group sequences among alignments (see example 2). If the defining string is at the end of each sequence ID, pass in a negative number to specify a fixed-length suffix. Alternatively, passing in a regular expression allows for very precise control of the groupings in cases were a simple prefix/suffix is insufficient.

Alignment names ( regex or int )

Optional. The position of each subsequence is annotated onto the final concatenated sequence by AlignBuddy, and this information will be written to certain rich formats like GenBank and EMBL. By default the original record ID will be added as the annotation, although, this can be overridden with some sub-identifier (specified by integer or regular expression) if you prefer. This works the same as the grouping pattern described above, but note that the order these arguments are passed in matters, so you cannot specify an alignment name without first specifying a grouping pattern.

Examples

Input file: Panx_C-term.physr

 3 62
Bfo-Panxα1   DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL--
Hca-Panxα1   --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS
Mle-Panxα1  DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--

 3 68
Bfo-Panxα4  -----EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--
Hca-Panxα4  -------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG
Mle-Panxα4  GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---

 3 61
Bfo-Panxα8  GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca-Panxα8  -DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle-Panxα8  ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV

Usage example 1

Pass in zero arguments and AlignBuddy will detect the shortest possible identifier for each new concatenated sequence (in this case, "B", "H", and "M").

$: alb Panx_C-term.physr -cta

Output

 3 191
B  DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
H  --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
M  DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV

Usage example 2

Group records by the three letter prefix found in each ID by passing in a positive integer as the first argument.

$: alb Panx_C-term.physr -cta 3

Output

 3 191
Bfo  DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca  --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle  DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV

Usage example 3

Use a regular expression to group records instead of a set-length prefix. Here, the two letter species code is the unique component of the IDs that groups are based on.

$: alb Panx_C-term.physr -cta "[a-z]{2}-Panx"

Output

 3 191
fo-Panx  DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
ca-Panx  --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
le-Panx  DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV

Usage example 4

If the group pattern does not find a match in a given alignment, than gaps are filled in for that component of the concatenated alignment.

$: alb Panx_C-term.physr -cta ".*1|..."

 6 191
Bfo-Panxα1  DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-----------------------------------------------------------------------------------------------------------------------------------
Hca-Panxα1  --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS---------------------------------------------------------------------------------------------------------------------------------
Mle-Panxα1  DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL-----------------------------------------------------------------------------------------------------------------------------------
Bfo         -------------------------------------------------------------------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca         ---------------------------------------------------------------------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle         --------------------------------------------------------------GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV

Usage example 5

The location of each component of the concatenated alignment is stored when using this tool, and will be annotated as a feature if outputting to a rich format like GenBank or EMBL.

$: alb Panx_C-term.physr -cta 3 -o genbank

Output

LOCUS       Bfo                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Bfo
VERSION     Bfo
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Bfo-Panxα1      1..62
     Bfo-Panxα4      63..130
     Bfo-Panxα8      131..191
ORIGIN
        1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
       61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
      121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
      181 hlkqtk-emp v
//
LOCUS       Hca                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Hca
VERSION     Hca
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Hca-Panxα1      1..62
     Hca-Panxα4      63..130
     Hca-Panxα8      131..191
ORIGIN
        1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
       61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
      121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
      181 hlkkaegean v
//
LOCUS       Mle                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Mle
VERSION     Mle
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Mle-Panxα1      1..62
     Mle-Panxα4      63..130
     Mle-Panxα8      131..191
ORIGIN
        1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
       61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
      121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
      181 hlkgsk-dil v
//

Usage example 6

Note in the above example that each component is annotated with the original sequence ID. You can restrict this by passing in an integer as the second argument, and this number of trailing characters will be used as the alignment name. Passing in a negative number will take the characters from the front of the sequence ID (this is opposite of the Group Pattern argument).

$: alb Panx_C-term.physr -cta ".{3}" 6

Output

LOCUS       Bfo                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Bfo
VERSION     Bfo
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Panxα1          1..62
     Panxα4          63..130
     Panxα8          131..191
ORIGIN
        1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
       61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
      121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
      181 hlkqtk-emp v
//
LOCUS       Hca                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Hca
VERSION     Hca
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Panxα1          1..62
     Panxα4          63..130
     Panxα8          131..191
ORIGIN
        1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
       61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
      121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
      181 hlkkaegean v
//
LOCUS       Mle                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Mle
VERSION     Mle
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Panxα1          1..62
     Panxα4          63..130
     Panxα8          131..191
ORIGIN
        1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
       61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
      121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
      181 hlkgsk-dil v
//

Usage example 7

Alignment names can also be specified with a regular expression. If no match is found, then the name reverts to the whole record ID.

$: alb temp.del -cta 3 "Panxα[1-5]" -o gb

Output

LOCUS       Bfo                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Bfo
VERSION     Bfo
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Panxα1          1..62
     Panxα4          63..130
     Bfo-Panxα8      131..191
ORIGIN
        1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
       61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
      121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
      181 hlkqtk-emp v
//
LOCUS       Hca                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Hca
VERSION     Hca
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Panxα1          1..62
     Panxα4          63..130
     Hca-Panxα8      131..191
ORIGIN
        1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
       61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
      121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
      181 hlkkaegean v
//
LOCUS       Mle                      191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   Mle
VERSION     Mle
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Panxα1          1..62
     Panxα4          63..130
     Mle-Panxα8      131..191
ORIGIN
        1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
       61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
      121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
      181 hlkgsk-dil v
//

Usage example 8

If you want to get really fancy with your regular expressions, include parentheses groups. Only the matches within the parentheses will be used in the final names.

$: alb Panx_C-term.physr -cta "^(.).{3}([^0-9]+)" "(P)anx(α.*)"

Output

LOCUS       BPanxα                   191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   BPanxα
VERSION     BPanxα
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Pα1             1..62
     Pα4             63..130
     Pα8             131..191
ORIGIN
        1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
       61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
      121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
      181 hlkqtk-emp v
//
LOCUS       HPanxα                   191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   HPanxα
VERSION     HPanxα
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Pα1             1..62
     Pα4             63..130
     Pα8             131..191
ORIGIN
        1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
       61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
      121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
      181 hlkkaegean v
//
LOCUS       MPanxα                   191 aa                     UNK 01-JAN-1980
DEFINITION  .
ACCESSION   MPanxα
VERSION     MPanxα
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     Pα1             1..62
     Pα4             63..130
     Pα8             131..191
ORIGIN
        1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
       61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
      121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
      181 hlkgsk-dil v
//

Main Toolkit Pages





Further Reading

Clone this wiki locally