-
Notifications
You must be signed in to change notification settings - Fork 23
AB Concatenate Alignments
Concatenates two or more alignments into a single alignment.
Records from each alignment are grouped together based on some shared identifier in their record IDs (e.g., an organism name), and each identifier must be present exactly 0 or 1 times in each alignment. As explained further below, there is a high degree of flexibility in how you specify how sequences should be grouped together: Auto-detection, fixed length prefix/suffix, or regular expression.
If you pass in no arguments, this tool will analyze the IDs of each sequence and select a prefix with the minimum length necessary to ensure unique identification within each alignment, and then use these prefixes to group records among alignments (see example 1).
Optional. Passing in a positive integer will use a fixed-length prefix from each record ID to group sequences among alignments (see example 2). If the defining string is at the end of each sequence ID, pass in a negative number to specify a fixed-length suffix. Alternatively, passing in a regular expression allows for very precise control of the groupings in cases were a simple prefix/suffix is insufficient.
Optional. The position of each subsequence is annotated onto the final concatenated sequence by AlignBuddy, and this information will be written to certain rich formats like GenBank and EMBL. By default the original record ID will be added as the annotation, although, this can be overridden with some sub-identifier (specified by integer or regular expression) if you prefer. This works the same as the grouping pattern described above, but note that the order these arguments are passed in matters, so you cannot specify an alignment name without first specifying a grouping pattern.
3 62
Bfo-Panxα1 DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL--
Hca-Panxα1 --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS
Mle-Panxα1 DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--
3 68
Bfo-Panxα4 -----EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--
Hca-Panxα4 -------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG
Mle-Panxα4 GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---
3 61
Bfo-Panxα8 GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca-Panxα8 -DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle-Panxα8 ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Pass in zero arguments and AlignBuddy will detect the shortest possible identifier for each new concatenated sequence (in this case, "B", "H", and "M").
$: alb Panx_C-term.physr -cta
3 191
B DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
H --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
M DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Group records by the three letter prefix found in each ID by passing in a positive integer as the first argument.
$: alb Panx_C-term.physr -cta 3
3 191
Bfo DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Use a regular expression to group records instead of a set-length prefix. Here, the two letter species code is the unique component of the IDs that groups are based on.
$: alb Panx_C-term.physr -cta "[a-z]{2}-Panx"
3 191
fo-Panx DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
ca-Panx --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
le-Panx DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
If the group pattern does not find a match in a given alignment, than gaps are filled in for that component of the concatenated alignment.
$: alb Panx_C-term.physr -cta ".*1|..."
6 191
Bfo-Panxα1 DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-----------------------------------------------------------------------------------------------------------------------------------
Hca-Panxα1 --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS---------------------------------------------------------------------------------------------------------------------------------
Mle-Panxα1 DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL-----------------------------------------------------------------------------------------------------------------------------------
Bfo -------------------------------------------------------------------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca ---------------------------------------------------------------------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle --------------------------------------------------------------GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
The location of each component of the concatenated alignment is stored when using this tool, and will be annotated as a feature if outputting to a rich format like GenBank or EMBL.
$: alb Panx_C-term.physr -cta 3 -o genbank
LOCUS Bfo 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Bfo
VERSION Bfo
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Bfo-Panxα1 1..62
Bfo-Panxα4 63..130
Bfo-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS Hca 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Hca
VERSION Hca
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Hca-Panxα1 1..62
Hca-Panxα4 63..130
Hca-Panxα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS Mle 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Mle
VERSION Mle
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Mle-Panxα1 1..62
Mle-Panxα4 63..130
Mle-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//
Note in the above example that each component is annotated with the original sequence ID. You can restrict this by passing in an integer as the second argument, and this number of trailing characters will be used as the alignment name. Passing in a negative number will take the characters from the front of the sequence ID (this is opposite of the Group Pattern argument).
$: alb Panx_C-term.physr -cta ".{3}" 6
LOCUS Bfo 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Bfo
VERSION Bfo
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS Hca 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Hca
VERSION Hca
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Panxα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS Mle 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Mle
VERSION Mle
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//
Alignment names can also be specified with a regular expression. If no match is found, then the name reverts to the whole record ID.
$: alb temp.del -cta 3 "Panxα[1-5]" -o gb
LOCUS Bfo 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Bfo
VERSION Bfo
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Bfo-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS Hca 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Hca
VERSION Hca
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Hca-Panxα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS Mle 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Mle
VERSION Mle
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Mle-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//
If you want to get really fancy with your regular expressions, include parentheses groups. Only the matches within the parentheses will be used in the final names.
$: alb Panx_C-term.physr -cta "^(.).{3}([^0-9]+)" "(P)anx(α.*)"
LOCUS BPanxα 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION BPanxα
VERSION BPanxα
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Pα1 1..62
Pα4 63..130
Pα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS HPanxα 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION HPanxα
VERSION HPanxα
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Pα1 1..62
Pα4 63..130
Pα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS MPanxα 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION MPanxα
VERSION MPanxα
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Pα1 1..62
Pα4 63..130
Pα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//