OneCodeToFindThemAll/README at master · AdriTara/OneCodeToFindThemAll · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
This set of Perl utilities is designed to extract transposable element (TE) sequences and quantitative data from the RepeatMasker .out files.

INSTALL NOTES

All codes are in Perl, and therefore no installation is necessary other that copying the files on your disk. If you run Windows, it may be necessary to install Perl before being able to run the code; see e.g.

http://learn.perl.org/installing/windows.html

for a way to do it.

If you want to make one code executable by itself, just change its permissions (on Mac OS /Linux/ Unix):

chmod u+x build_dictionary.pl
chmod u+x one_code_to_find_them_all.pl

If you want to invoke these codes from any directory on your disk, you should either edit your $PATH variable and add the directry where the codes are copied into it, or copy directly the codes in, e.g.,  /usr/local/bin/

RUNNING

### build_dictionary.pl ###

This code infers correspondencies between LTR subparts of LTR retrotranposons and internal subparts, based on name similarity. All input files (which have to be RepeatMasker output files) are scanned for their "matching repeat" column, and all names are then assembled into pairs of internal and LTR subparts if they are sufficiently close. It outputs a file which can then be used as input by once_code_to_find_them.pl in order to reconstitute complete copies of LTR retrotransposon.

Usage:

./build_dictionary.pl --rm filename [--unknown] [--fuzzy] > output

Usage is reminded to you if you type
./build_dictionary.pl --help


*** Mandatory options ***

--rm filename

Indicates the code to run on RepeatMasker output file filename. If filename is a directory, all .out files inside this directory and all sub-directories (recursively) will be scanned.


*** Facultative options ***

--unknown

Indicates to the code that the RepeatMasker output file passed in the --rm option may contain transposons of "Unkown" class/family, which is the case in particular when a local library was used in running RepeatMasker. If this option is passed, these "Unknown" elements will be included in the analysis, if not, they are discarded.

--fuzzy

Indicates to the code to be less stringent in the criterions used to match names between corresponding subparts. May be useful to reconstruct all internal-LTR pairs in the data, but with a much higher proportion of false positives, i.e. pairs that should not match but are reported.


*** Output format ***

Lines of the form:
BLASTOPIA_I     BLASTOPIA_LTR

Internal subpart is always written first, LTR subpart second. When no match was found for an element, it is reported alone on its line. On the contrary, when multiple matches were found, they are reported  separated by ":", as follows:

HERV-Fc1-int    HERV-Fc1_LTR1:HERV-Fc1_LTR2:HERV-Fc1_LTR3

There is always one line per internal element; if multiple internal subparts match multiple LTR subparts, there will be one line per internal subpart, and matching LTR subparts will be separated by ":". Columns are always separated by tabulations.


### one_code_to_find_them_all.pl ###

Usage:
./one_code_to_find_them_all.pl --rm filename --ltr file_dictionary [--length file_length] [--strict] [--choice] [--dry-run]  [--fasta file_fasta [--flanking X]] [--insert Y]

Usage is reminded to you if you type
./one_code_to_find_them_all.pl --help


*** Mandatory options ***

--rm  filename

Indicates the code to run on RepeatMasker output file filename. If filename is a directory, all .out files inside this directory and all sub-directories (recursively) will be scanned.

--ltr file_dictionary

Indicates the code to run with the list of internal/LTR pairs indicated in the file file_dictionary. This file must have a format such as the one given in output of build_dictionary.pl .


*** Facultative options ***

--length file_length

If a file is passed, the element lengths given in the file file_length will be used as reference length for the elements, in the computation of the conservation of the elements. In this case, file_length must have the following format:

BLASTOPIA_I     4481
BLASTOPIA_LTR   275

Columns are always separated by tabulations. Only the elements present in file_length will be studied, so this option is a good way to restrict the code to only some elements of interest.

If this option is not given to the code, it will first scan all RepeatMasker output files to create this file, either in the same directory as the file under study, or, if a directory is passed in the --rm option, in the directory right above it. E.g. a ./one_code_to_find_them.pl --rm Drosophila/ will create a file Drosophila.length in the directory containing Drosophila/ . In this case, all elements will be studied.

The reference length of each element is computed as follows: for each RM line, the corresponding reference element length is computed as the sum of the last position of the element on the reference and the total number of remaining bases in the reference.
In the case of a "+" oriented element, this means adding columns 12 and columns 13 (numbering starting at 0) of the RM line; in the case of a "C" oriented element, columns 12 and 11 are summed.

Rarely, due to the way RM detects elements, this computation does not give the same result for all occurences of the same element (by example when multiple consensus are used for different subparts of the same element, such as for some LINEs); then, the reference length written in the .length file is the one most commonly found for each element.

--strict

This option activates the 80-80 rule, i.e. only copies with more than 80% identity to the reference and more than 80 bp long are reported. If not given, all copies passing the code tests are reported (see manuscript for further details).

--choice

This option puts the code in interactive mode. In this case, every time that multiple solutions (i.e. multiple ways of assembling together transposons fragments) are found in a run, the code will ask the user for a choice to be made. Choices are presented in this way:


Cannot decide between

Solution 0
###1499/399     14.619  18.387  3.399   chr2L   22823462        22824083        674     C       DMCR1A  LINE/CR1 (343)    4127    910     11977/11978     2
1499    13.2    23.5    0.2     chr2L   22823462        22823915        560     C       DMCR1A  LINE/CR1        (343)     4127    3568    11977   1
399     19.5    0.8     14.4    chr2L   22823952        22824083        114     C       DMCR1A  LINE/CR1        (3447)    1023    910     11978   1

Solution 1
###1499/2326    16.218  15.653  0.804   chr2L   22823462        22824648        1156    C       DMCR1A  LINE/CR1 (343)    4127    157     11977/11978     2
1499    13.2    23.5    0.2     chr2L   22823462        22823915        560     C       DMCR1A  LINE/CR1        (343)     4127    3568    11977   1
2326    18.7    9.2     1.3     chr2L   22824097        22824648        596     C       DMCR1A  LINE/CR1        (3718)    752     157     11978   1

Solution 2
###1499 13.2    23.5    0.2     chr2L   22823462        22823915        560     C       DMCR1A  LINE/CR1        (343)     4127    3568    11977   1

Context (type 'm' to see a broader context if necessary) :
1499    13.2    23.5    0.2     chr2L   22823462        22823915        560     C       DMCR1A  LINE/CR1        (343)     4127    3568    11977   1
399     19.5    0.8     14.4    chr2L   22823952        22824083        114     C       DMCR1A  LINE/CR1        (3447)    1023    910     11978   1
2326    18.7    9.2     1.3     chr2L   22824097        22824648        596     C       DMCR1A  LINE/CR1        (3718)    752     157     11978   1
455     7.1     0.0     1.4     chr2L   22844530        22844600        70      C       DMCR1A  LINE/CR1        (3209)    1261    1192    12000   1
530     20.3    0.8     9.3     chr2L   22844895        22845111        276     C       DMCR1A  LINE/CR1        (3285)    1185    910     12000   1
2314    18.9    9.2     1.3     chr2L   22845125        22845676        596     C       DMCR1A  LINE/CR1        (3718)    752     157     12000   1
1767    4.7     15.8    0.6     chr2L   22891580        22891939        415     +       DMCR1A  LINE/CR1        3867      4281    (189)   12055   1

Which solution do you prefer?
Please type corresponding number


Multiple solutions are presented (here 3, numbered 0,1 and 2). Each solution is written with a line starting with ### and showing the final element assembled, followed by the lines corresponding to the fragments assembled. The last solution always consist in the element under consideration alone. Below the proposed solutions, the context, ie all surrounding RepeatMasker lines containing fragments of the considered element (or its internal/LTR coutnerpart) are written, in order to help the user decide which combination is the best one. If you want to see a broader context, type "m", and an extract from the RepeatMasker file starting at the position of the element under consideration, and filtered for non transposable element lines, will be given instead (if you type "m" multiple times, the shown contextx grows each time).

The code always assemble elements by pairs first, i.e., to assemble four ambiguous fragments  A, B, C and D together, one has to decide first to assemble A and B, and then assemble A-B -- which is now considered as one element -- with C. It may happen that the final assembly -- here A-B-C and D -- is not ambiguous, and the code will automatically report it without asking the user; if an ambiguity still exists, the user will be asked upon.

In order to minimize the number of choices to be made, the code uses RM block IDs (RM outptu last column, see RM help) to disambiguate cases; an ambiguous case where one combination at least would match two fragments with the same RM Block ID is automatically chosen. When this happens, it appears on the terminal such as that:


Choice bypassed -- Assembling fragments based on RM Block ID
###3599/8904    2.468   4.203   0.000   chr2L   22768720        22770384        1750    +       DMCR1A  LINE/CR1 1292     4470    (0)     11926/11926     2
3599    2.1     6.9     0.0     chr2L   22768720        22769369        695     +       DMCR1A  LINE/CR1        1292    1986    (2484)  11926   1
8904    2.7     2.5     0.0     chr2L   22769356        22770384        1055    +       DMCR1A  LINE/CR1        3416   4470    (0)     11926   1


There may be many choices in a single chromosome, and they must all be done before the code stops; see --dry-run below to estimate the number of choices beforehand. To quit choice mode in the middle of a run, type "q": in this case, the last choice asked and all further ones will be considered as 0.

The first one (solution 0) is chosen by default when the code runs without the --choice option. Care must be taken: running the code without --choice, and with --choice, but choosing 0 each time, will not give exactly the same results: in the second case RM Block IDs will have been used to assemble some fragments, while they are ignored in the first case. If you want to use RM Block IDs everywhere, but do not want to have to choose anywhere, one solution is to run the code in choice mode, and to type "q" at the first choice, answering automatically 0 to all others, but using RM block IDs in the process.

--unknown

Indicates to the code that the RepeatMasker output file passed in the --rm option may contain transposons of "Unknown" or "Unspecified" class/family, which is the case in particular when a non annotated local library was used in running RepeatMasker. If this option is passed, these "Unknown" elements will be included in the analysis, if not, they are discarded.

--fasta

This option enables the code to retrieve fasta sequences corresponding to the TE copies generated in output. Fasta sequences are identified in the RepeatMasker output by column 4, the scaffold name, which should be identical to the sequence name in the fasta files (e.g. >chr2L). This option can be used in various ways:
--fasta looks for .fa files next to the .rm files passed to the --rm argument (it can search recursively into directories).
--fasta filename only look for fasta sequences in filename
--fasta directory search recursively for all .fa files in the directory and its subdirectories.

This option outputs one file per contig in the RepeatMasker outptut. When copies are assembled form fragments, the fasta sequence given in output is the concatenation of the fragments, whose coordinates are reported in the sequence name.

--flanking X

where X must be a positive number (it is replaced by 0 if a negative number is passed, thus having no effect).
This option can only be used in conjonction with --fasta. When used, the fasta sequences reported also contains the X bases flanking the fasta sequence on the side of each TE fragment. Then, a single fragment copy with 100 bp flanking will be 200bp longer than without the flanking sequences, but a two-fragment copy in the same circonstances will be 400bp longer -- supposing there is no overlap in the flanking sequences.

--insert Y

where Y must be a positive number (the option has no effect if Y is negative or if no value is passed)
This option changes the way the fragments are asembled. Without it, two fragments are assembled if the distance separing their furthest extremities are not separated by more than twice the reference length of the element (see manuscript for other conditions). With this option enabled, the distance between the closest extremities of the fragments must not be separated by more than Y bases, allowing to tune the the fragment assembly process, and to filter nested elements by the size of the insertion. When Y=0, the fragments have to be in contact (or overlapping) in order to be assembled as a copy.

--dry-run

This option do a complete run of the code on the given data, but do not write any outfiles at the end. It is mainly useful for knowing beforehand the number of choices that would have been made if the --choice option is activated.


*** Output format ***

For each RepeatMasker input file filename, this code outputs, in the same directory as filename:
- filename.log.txt

This file contains a summary of all operations done during the execution of the code, in particular everything that has been written in the terminal. All informations are also reported in other files but the log may be useful in case of unexpected behavior.

Moreover, for each scaffold "scaf", i.e. different values in column "Query" of filename, there are 4 output files, in the same directory as filename:

- filename_scaf.transposons.csv
- filename_scaf.ltr.csv
- filename_scaf.elem_sorted.csv
- filename_scaf.copynumber.csv


The first 2 files (.ltr.csv and .transposons.csv) report the copies found, respectively for non-LTR retrotransposons and for LTR retrotransposons. The first line of each file indicates the columns content, in order:

Score			the score reported by RepeatMasker in the input file.
%_Div			the percentage of divergence of the copy from the reference, computed by averaging the fragments divergences
%_Del			similar to %_Div for deletion
%_Ins			similar to %_Div for insertions
Query			scaffold on which the copy was found
Beg.				start position, on the scaffold, of the copy
End				end position, on the scaffold, of the copy
Length			the element length in the genomic sequence
Sense			+ if the transposon is inserted in 5'->3', C for a 3'->5' insertion
Element			the transposable element name
Family			the transposable element family or class
Pos_Repeat_Beg	the start of the actual sequence relative to the reference
Pos_Repeat_End	the end of the actual sequence relative to the reference
Pos_Repeat_End	the count of bases in the reference located after the end of the matching actual sequence
ID				the RepeatMasker fragment ID
Num_Assembled	the number of fragments merged or assembled into this copy
%_of_Ref		the fraction of the reference covered by matching bases in the copy (solo-LTR are a special case, see below)

Lines starting with ### correspond to merged copies (for non-LTR) or merged and assembled copies (for LTR-retrotransposon). The corresponding fragments are given in the lines below them, with their RepeatMasker line (where column 7 has been modified to contain the element length instead of the remaining scaffold size). Sometimes a single fragment could not be merged/assembled with antyhing, and is considered a copy by itself, therefore appearing alone.


**Example for transposons:

*A well conserved PROTOP element:

###5090/23954   2.103   0.614   0.414   chr2L   19737411        19740819        3425    +       PROTOP  DNA/P   1       4480    (0)     9628/9628     2       0.765
5090    2.6     0.2     0.0     chr2L   19737411        19737996        587     +       PROTOP  DNA/P   1       587     (3893)  9628    1
23954   2.0     0.7     0.5     chr2L   19737988        19740819        2838    +       PROTOP  DNA/P   1643    4480    (0)     9628    1

*A PROTOP fragment considered as a (poor quality) copy itself:

###499  11.5    3.1     0.0     chr2L   20155636        20155731        99      +       PROTOP  DNA/P   3521    3619    (861)   9822    1    0.022


**Example for LTR:

*A perfect, fully assembled Copia element:

###2442/41712/2442      0.089   0.000   0.000   chr2L   2294097 2299243 5145    C       Copia_LTR       LTR/Copia       NA      NA      NA   845/846/847      3       1.000
2442    0.0     0.0     0.0     chr2L   2294097 2294372 276     C       Copia_LTR       LTR/Copia       (0)     276     1       845     1
41712   0.1     0.0     0.0     chr2L   2294373 2298967 4593    C       Copia_I LTR/Copia       (0)     4593    1       846     1
2442    0.0     0.0     0.0     chr2L   2298968 2299243 276     C       Copia_LTR       LTR/Copia       (0)     276     1       847     1

*A solo-LTR:

###1731 1.0     0.0     0.0     chr2L   21560590        21560787        198     C       Copia2_LTR_DM   LTR/Copia       (0)     198     1    10726    1       No_ref_available

For LTR retrotransposons, the reference length (and then the %_of_Ref) are computed as the reference lenght of the internal part plus twice the length of the LTR part. When a LTR is detected alone and cannot be assembled with a matching internal part, it is considered as a solo-LTR, and the reference length  (and then the %_of_Ref) is only the LTR part reference length, in order not to bias the %_of_Ref value.


The .elem_sorted.csv file contains the same information, but elements are sorted by position and not by family, allowing to study them in their genomic context.

The .copynumber.csv contains a summary of the number of fragments and copies found for each element. The columns are the following:

Family			the transposable element family or class
Element			the transposable element name (internal part only for LTR retrotransposons)
Length			the reference length of the element, as written in file_length. For LTR-retrotransposon, this is one internal part plus two LTR parts lengths. NA means that no reference length could be computed (usually because of a LTR or internal part lacking for retrotransposon LTR)
Fragments		the number of fragments, i.e. RM file lines, that have been found for the element
Copies			the number of assembled copies of the element.
Solo_LTR		for LTR retrotransposons, this is the number of copies which are solo-LTR (i.e, solo-LTR are counted both in the "Copies" column and in this one). For non LTR retrotransposons elements, this is NA.
Total_Bp			the total number of bases covered by the element.
Cover			the scaffold cover, given in percentage, i.e. this is Total_Bp/Scaffold_size*100

For each class, a summary is given after the list of all elements, starting with ###. Global summaries are also given at the end of the file for DNA, LINE, SINE and LTR elements, starting with ######. Summaries for all transposable elements (#########Type:EVERYTHING_TE) and for all non transposable elements masked sequences (Low_complexity, Satellite, Simple_repeat, Unknown) are given at the end of the file.


If you find any bug, please report it to:
emmanuelle.lerat@univ-lyon1.fr
marc.bailly-bechet@univ-lyon1.fr

File last modified in March 2014.