forked from mptrsen/mobilome
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
271 lines (159 loc) · 19.8 KB
/
README
File metadata and controls
271 lines (159 loc) · 19.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
This set of Perl utilities is designed to extract transposable element (TE) sequences and quantitative data from the RepeatMasker .out files.
INSTALL NOTES
All codes are in Perl, and therefore no installation is necessary other that copying the files on your disk. If you run Windows, it may be necessary to install Perl before being able to run the code; see e.g.
http://learn.perl.org/installing/windows.html
for a way to do it.
If you want to make one code executable by itself, just change its permissions (on Mac OS /Linux/ Unix):
chmod u+x build_dictionary.pl
chmod u+x one_code_to_find_them_all.pl
If you want to invoke these codes from any directory on your disk, you should either edit your $PATH variable and add the directry where the codes are copied into it, or copy directly the codes in, e.g., /usr/local/bin/
RUNNING
### build_dictionary.pl ###
This code infers correspondencies between LTR subparts of LTR retrotranposons and internal subparts, based on name similarity. All input files (which have to be RepeatMasker output files) are scanned for their "matching repeat" column, and all names are then assembled into pairs of internal and LTR subparts if they are sufficiently close. It outputs a file which can then be used as input by once_code_to_find_them.pl in order to reconstitute complete copies of LTR retrotransposon.
Usage:
./build_dictionary.pl --rm filename [--unknown] [--fuzzy] > output
Usage is reminded to you if you type
./build_dictionary.pl --help
*** Mandatory options ***
--rm filename
Indicates the code to run on RepeatMasker output file filename. If filename is a directory, all .out files inside this directory and all sub-directories (recursively) will be scanned.
*** Facultative options ***
--unknown
Indicates to the code that the RepeatMasker output file passed in the --rm option may contain transposons of "Unkown" class/family, which is the case in particular when a local library was used in running RepeatMasker. If this option is passed, these "Unknown" elements will be included in the analysis, if not, they are discarded.
--fuzzy
Indicates to the code to be less stringent in the criterions used to match names between corresponding subparts. May be useful to reconstruct all internal-LTR pairs in the data, but with a much higher proportion of false positives, i.e. pairs that should not match but are reported.
*** Output format ***
Lines of the form:
BLASTOPIA_I BLASTOPIA_LTR
Internal subpart is always written first, LTR subpart second. When no match was found for an element, it is reported alone on its line. On the contrary, when multiple matches were found, they are reported separated by ":", as follows:
HERV-Fc1-int HERV-Fc1_LTR1:HERV-Fc1_LTR2:HERV-Fc1_LTR3
There is always one line per internal element; if multiple internal subparts match multiple LTR subparts, there will be one line per internal subpart, and matching LTR subparts will be separated by ":". Columns are always separated by tabulations.
### one_code_to_find_them_all.pl ###
Usage:
./one_code_to_find_them_all.pl --rm filename --ltr file_dictionary [--length file_length] [--strict] [--choice] [--dry-run] [--fasta file_fasta [--flanking X]] [--insert Y]
Usage is reminded to you if you type
./one_code_to_find_them_all.pl --help
*** Mandatory options ***
--rm filename
Indicates the code to run on RepeatMasker output file filename. If filename is a directory, all .out files inside this directory and all sub-directories (recursively) will be scanned.
--ltr file_dictionary
Indicates the code to run with the list of internal/LTR pairs indicated in the file file_dictionary. This file must have a format such as the one given in output of build_dictionary.pl .
*** Facultative options ***
--length file_length
If a file is passed, the element lengths given in the file file_length will be used as reference length for the elements, in the computation of the conservation of the elements. In this case, file_length must have the following format:
BLASTOPIA_I 4481
BLASTOPIA_LTR 275
Columns are always separated by tabulations. Only the elements present in file_length will be studied, so this option is a good way to restrict the code to only some elements of interest.
If this option is not given to the code, it will first scan all RepeatMasker output files to create this file, either in the same directory as the file under study, or, if a directory is passed in the --rm option, in the directory right above it. E.g. a ./one_code_to_find_them.pl --rm Drosophila/ will create a file Drosophila.length in the directory containing Drosophila/ . In this case, all elements will be studied.
The reference length of each element is computed as follows: for each RM line, the corresponding reference element length is computed as the sum of the last position of the element on the reference and the total number of remaining bases in the reference.
In the case of a "+" oriented element, this means adding columns 12 and columns 13 (numbering starting at 0) of the RM line; in the case of a "C" oriented element, columns 12 and 11 are summed.
Rarely, due to the way RM detects elements, this computation does not give the same result for all occurences of the same element (by example when multiple consensus are used for different subparts of the same element, such as for some LINEs); then, the reference length written in the .length file is the one most commonly found for each element.
--strict
This option activates the 80-80 rule, i.e. only copies with more than 80% identity to the reference and more than 80 bp long are reported. If not given, all copies passing the code tests are reported (see manuscript for further details).
--choice
This option puts the code in interactive mode. In this case, every time that multiple solutions (i.e. multiple ways of assembling together transposons fragments) are found in a run, the code will ask the user for a choice to be made. Choices are presented in this way:
Cannot decide between
Solution 0
###1499/399 14.619 18.387 3.399 chr2L 22823462 22824083 674 C DMCR1A LINE/CR1 (343) 4127 910 11977/11978 2
1499 13.2 23.5 0.2 chr2L 22823462 22823915 560 C DMCR1A LINE/CR1 (343) 4127 3568 11977 1
399 19.5 0.8 14.4 chr2L 22823952 22824083 114 C DMCR1A LINE/CR1 (3447) 1023 910 11978 1
Solution 1
###1499/2326 16.218 15.653 0.804 chr2L 22823462 22824648 1156 C DMCR1A LINE/CR1 (343) 4127 157 11977/11978 2
1499 13.2 23.5 0.2 chr2L 22823462 22823915 560 C DMCR1A LINE/CR1 (343) 4127 3568 11977 1
2326 18.7 9.2 1.3 chr2L 22824097 22824648 596 C DMCR1A LINE/CR1 (3718) 752 157 11978 1
Solution 2
###1499 13.2 23.5 0.2 chr2L 22823462 22823915 560 C DMCR1A LINE/CR1 (343) 4127 3568 11977 1
Context (type 'm' to see a broader context if necessary) :
1499 13.2 23.5 0.2 chr2L 22823462 22823915 560 C DMCR1A LINE/CR1 (343) 4127 3568 11977 1
399 19.5 0.8 14.4 chr2L 22823952 22824083 114 C DMCR1A LINE/CR1 (3447) 1023 910 11978 1
2326 18.7 9.2 1.3 chr2L 22824097 22824648 596 C DMCR1A LINE/CR1 (3718) 752 157 11978 1
455 7.1 0.0 1.4 chr2L 22844530 22844600 70 C DMCR1A LINE/CR1 (3209) 1261 1192 12000 1
530 20.3 0.8 9.3 chr2L 22844895 22845111 276 C DMCR1A LINE/CR1 (3285) 1185 910 12000 1
2314 18.9 9.2 1.3 chr2L 22845125 22845676 596 C DMCR1A LINE/CR1 (3718) 752 157 12000 1
1767 4.7 15.8 0.6 chr2L 22891580 22891939 415 + DMCR1A LINE/CR1 3867 4281 (189) 12055 1
Which solution do you prefer?
Please type corresponding number
Multiple solutions are presented (here 3, numbered 0,1 and 2). Each solution is written with a line starting with ### and showing the final element assembled, followed by the lines corresponding to the fragments assembled. The last solution always consist in the element under consideration alone. Below the proposed solutions, the context, ie all surrounding RepeatMasker lines containing fragments of the considered element (or its internal/LTR coutnerpart) are written, in order to help the user decide which combination is the best one. If you want to see a broader context, type "m", and an extract from the RepeatMasker file starting at the position of the element under consideration, and filtered for non transposable element lines, will be given instead (if you type "m" multiple times, the shown contextx grows each time).
The code always assemble elements by pairs first, i.e., to assemble four ambiguous fragments A, B, C and D together, one has to decide first to assemble A and B, and then assemble A-B -- which is now considered as one element -- with C. It may happen that the final assembly -- here A-B-C and D -- is not ambiguous, and the code will automatically report it without asking the user; if an ambiguity still exists, the user will be asked upon.
In order to minimize the number of choices to be made, the code uses RM block IDs (RM outptu last column, see RM help) to disambiguate cases; an ambiguous case where one combination at least would match two fragments with the same RM Block ID is automatically chosen. When this happens, it appears on the terminal such as that:
Choice bypassed -- Assembling fragments based on RM Block ID
###3599/8904 2.468 4.203 0.000 chr2L 22768720 22770384 1750 + DMCR1A LINE/CR1 1292 4470 (0) 11926/11926 2
3599 2.1 6.9 0.0 chr2L 22768720 22769369 695 + DMCR1A LINE/CR1 1292 1986 (2484) 11926 1
8904 2.7 2.5 0.0 chr2L 22769356 22770384 1055 + DMCR1A LINE/CR1 3416 4470 (0) 11926 1
There may be many choices in a single chromosome, and they must all be done before the code stops; see --dry-run below to estimate the number of choices beforehand. To quit choice mode in the middle of a run, type "q": in this case, the last choice asked and all further ones will be considered as 0.
The first one (solution 0) is chosen by default when the code runs without the --choice option. Care must be taken: running the code without --choice, and with --choice, but choosing 0 each time, will not give exactly the same results: in the second case RM Block IDs will have been used to assemble some fragments, while they are ignored in the first case. If you want to use RM Block IDs everywhere, but do not want to have to choose anywhere, one solution is to run the code in choice mode, and to type "q" at the first choice, answering automatically 0 to all others, but using RM block IDs in the process.
--unknown
Indicates to the code that the RepeatMasker output file passed in the --rm option may contain transposons of "Unknown" or "Unspecified" class/family, which is the case in particular when a non annotated local library was used in running RepeatMasker. If this option is passed, these "Unknown" elements will be included in the analysis, if not, they are discarded.
--fasta
This option enables the code to retrieve fasta sequences corresponding to the TE copies generated in output. Fasta sequences are identified in the RepeatMasker output by column 4, the scaffold name, which should be identical to the sequence name in the fasta files (e.g. >chr2L). This option can be used in various ways:
--fasta looks for .fa files next to the .rm files passed to the --rm argument (it can search recursively into directories).
--fasta filename only look for fasta sequences in filename
--fasta directory search recursively for all .fa files in the directory and its subdirectories.
This option outputs one file per contig in the RepeatMasker outptut. When copies are assembled form fragments, the fasta sequence given in output is the concatenation of the fragments, whose coordinates are reported in the sequence name.
--flanking X
where X must be a positive number (it is replaced by 0 if a negative number is passed, thus having no effect).
This option can only be used in conjonction with --fasta. When used, the fasta sequences reported also contains the X bases flanking the fasta sequence on the side of each TE fragment. Then, a single fragment copy with 100 bp flanking will be 200bp longer than without the flanking sequences, but a two-fragment copy in the same circonstances will be 400bp longer -- supposing there is no overlap in the flanking sequences.
--insert Y
where Y must be a positive number (the option has no effect if Y is negative or if no value is passed)
This option changes the way the fragments are asembled. Without it, two fragments are assembled if the distance separing their furthest extremities are not separated by more than twice the reference length of the element (see manuscript for other conditions). With this option enabled, the distance between the closest extremities of the fragments must not be separated by more than Y bases, allowing to tune the the fragment assembly process, and to filter nested elements by the size of the insertion. When Y=0, the fragments have to be in contact (or overlapping) in order to be assembled as a copy.
--dry-run
This option do a complete run of the code on the given data, but do not write any outfiles at the end. It is mainly useful for knowing beforehand the number of choices that would have been made if the --choice option is activated.
*** Output format ***
For each RepeatMasker input file filename, this code outputs, in the same directory as filename:
- filename.log.txt
This file contains a summary of all operations done during the execution of the code, in particular everything that has been written in the terminal. All informations are also reported in other files but the log may be useful in case of unexpected behavior.
Moreover, for each scaffold "scaf", i.e. different values in column "Query" of filename, there are 4 output files, in the same directory as filename:
- filename_scaf.transposons.csv
- filename_scaf.ltr.csv
- filename_scaf.elem_sorted.csv
- filename_scaf.copynumber.csv
The first 2 files (.ltr.csv and .transposons.csv) report the copies found, respectively for non-LTR retrotransposons and for LTR retrotransposons. The first line of each file indicates the columns content, in order:
Score the score reported by RepeatMasker in the input file.
%_Div the percentage of divergence of the copy from the reference, computed by averaging the fragments divergences
%_Del similar to %_Div for deletion
%_Ins similar to %_Div for insertions
Query scaffold on which the copy was found
Beg. start position, on the scaffold, of the copy
End end position, on the scaffold, of the copy
Length the element length in the genomic sequence
Sense + if the transposon is inserted in 5'->3', C for a 3'->5' insertion
Element the transposable element name
Family the transposable element family or class
Pos_Repeat_Beg the start of the actual sequence relative to the reference
Pos_Repeat_End the end of the actual sequence relative to the reference
Pos_Repeat_End the count of bases in the reference located after the end of the matching actual sequence
ID the RepeatMasker fragment ID
Num_Assembled the number of fragments merged or assembled into this copy
%_of_Ref the fraction of the reference covered by matching bases in the copy (solo-LTR are a special case, see below)
Lines starting with ### correspond to merged copies (for non-LTR) or merged and assembled copies (for LTR-retrotransposon). The corresponding fragments are given in the lines below them, with their RepeatMasker line (where column 7 has been modified to contain the element length instead of the remaining scaffold size). Sometimes a single fragment could not be merged/assembled with antyhing, and is considered a copy by itself, therefore appearing alone.
**Example for transposons:
*A well conserved PROTOP element:
###5090/23954 2.103 0.614 0.414 chr2L 19737411 19740819 3425 + PROTOP DNA/P 1 4480 (0) 9628/9628 2 0.765
5090 2.6 0.2 0.0 chr2L 19737411 19737996 587 + PROTOP DNA/P 1 587 (3893) 9628 1
23954 2.0 0.7 0.5 chr2L 19737988 19740819 2838 + PROTOP DNA/P 1643 4480 (0) 9628 1
*A PROTOP fragment considered as a (poor quality) copy itself:
###499 11.5 3.1 0.0 chr2L 20155636 20155731 99 + PROTOP DNA/P 3521 3619 (861) 9822 1 0.022
**Example for LTR:
*A perfect, fully assembled Copia element:
###2442/41712/2442 0.089 0.000 0.000 chr2L 2294097 2299243 5145 C Copia_LTR LTR/Copia NA NA NA 845/846/847 3 1.000
2442 0.0 0.0 0.0 chr2L 2294097 2294372 276 C Copia_LTR LTR/Copia (0) 276 1 845 1
41712 0.1 0.0 0.0 chr2L 2294373 2298967 4593 C Copia_I LTR/Copia (0) 4593 1 846 1
2442 0.0 0.0 0.0 chr2L 2298968 2299243 276 C Copia_LTR LTR/Copia (0) 276 1 847 1
*A solo-LTR:
###1731 1.0 0.0 0.0 chr2L 21560590 21560787 198 C Copia2_LTR_DM LTR/Copia (0) 198 1 10726 1 No_ref_available
For LTR retrotransposons, the reference length (and then the %_of_Ref) are computed as the reference lenght of the internal part plus twice the length of the LTR part. When a LTR is detected alone and cannot be assembled with a matching internal part, it is considered as a solo-LTR, and the reference length (and then the %_of_Ref) is only the LTR part reference length, in order not to bias the %_of_Ref value.
The .elem_sorted.csv file contains the same information, but elements are sorted by position and not by family, allowing to study them in their genomic context.
The .copynumber.csv contains a summary of the number of fragments and copies found for each element. The columns are the following:
Family the transposable element family or class
Element the transposable element name (internal part only for LTR retrotransposons)
Length the reference length of the element, as written in file_length. For LTR-retrotransposon, this is one internal part plus two LTR parts lengths. NA means that no reference length could be computed (usually because of a LTR or internal part lacking for retrotransposon LTR)
Fragments the number of fragments, i.e. RM file lines, that have been found for the element
Copies the number of assembled copies of the element.
Solo_LTR for LTR retrotransposons, this is the number of copies which are solo-LTR (i.e, solo-LTR are counted both in the "Copies" column and in this one). For non LTR retrotransposons elements, this is NA.
Total_Bp the total number of bases covered by the element.
Cover the scaffold cover, given in percentage, i.e. this is Total_Bp/Scaffold_size*100
For each class, a summary is given after the list of all elements, starting with ###. Global summaries are also given at the end of the file for DNA, LINE, SINE and LTR elements, starting with ######. Summaries for all transposable elements (#########Type:EVERYTHING_TE) and for all non transposable elements masked sequences (Low_complexity, Satellite, Simple_repeat, Unknown) are given at the end of the file.
If you find any bug, please report it to:
emmanuelle.lerat@univ-lyon1.fr
marc.bailly-bechet@univ-lyon1.fr
File last modified in March 2014.