- The stLFR2Supernova is a pipeline to de novo assemble the stLFR raw reads using Supernova Assembler.
- stLFR raw reads refer to the raw reads generated by single-tube Long Fragment Read (stLFR) platform from BGI [1].
- Supernova Assembler refers to the de-novo software from 10X Genomes [2].
git clone --recursive https://github.com/BGI-Qingdao/stlfr2supernova_pipeline.git YOUR-INSTALL-DIR
or
git clone https://github.com/BGI-Qingdao/stlfr2supernova_pipeline.git YOUR-INSTALL-DIR
cd YOUR-INSTALL-DIR
git submodule init
git submodule update
Written by Perl and Shell languages. Unnecessary to do extra compilation or installation. No other dependencies.
├── README.md
├── bin
│ ├── SOAPfilter_v2.2 # filters duplication & adaptors.
│ ├── split_barcode.pl # splits raw reads, get barcode sequences, and convert stLFR raw reads to stLFR reads.
│ ├── merge_barcodes.pl # randomly merges each 8 stLFR barcodes to 1 10X-format barcode.
│ └── fake_10x.pl # converts stLFR reads to 10X reads.
├── data
│ ├── barcode.list # barcode list of stLFR.
│ └── lane.lst # config for SOAPfilter.
├── profile # default profile.
├── run.sh # 1-step executable script, which calls all four substeps 0 ~ 3.
├── step_0_split_barode.sh # calls split_barcode.pl and generate split_reads.*.fq.gz & barcode_freq.txt.
├── step_1_filter.sh # calls SOAPfilter_v2.2 and generate split_reads.*.fq.gz.clean.gz.
├── step_2_fake_10X_data.sh # calls merge_barcodes.pl & fake_10x.pl and generate merge.txt & read-R*_si-TTCACGCG_lane-001-chunk-001.fastq.gz.
└── step_3_run_supernova.sh # calls Supernova Assmebler.
- 1st, create a new work folder
> mkdir YourProjectRoot
- 2nd, copy the default profile into your work folder
> cd YourProjectRoot
> cp YOUR-INSTALL-DIR/profile ./your_own_profile
**Do not change the filename. Always use "profile".**
- 3rd, edit your_own_profile
> vi ./your_own_profile
-
4th, run the 1-step executable script with your profile or run it step by step
- 4.1 run the 1-step executable script
> YOUR-INSTALL-DIR/run.sh # make sure your run this command in YourProjectRoot
-
- 4.2 run the pipeline step by step or just run what you want
> YOUR-INSTALL-DIR/step_x_xxxx.sh # make sure your run this command in YourProjectRoot
Do not change the filename. Always use "profile".
Make sure the profile file is in YourProjectRoot directory.
#
# basic settings below
# MODIFY HERE FOR YOU PROJECT !!!
#
r1="L1.1.fq.gz L2.1.fq.gz" # stLFR raw read1. use " " to seperate differnt lanes.
r2="L1.2.fq.gz L2.2.fq.gz" # stLFR raw read2. use " " to seperate differnt lanes.
USE_FILTER="yes" # yes or no. # use SOAPFilter or not
BARCODE_FREQ_THRESHOLD=2 # if the number of read pairs sharing the same barcode is smaller then BARCODE_FREQ_THRESHOLD, then discard the barcode.
MAP_RATIO=8 # merge each MAP_RATIO stLFR barcodes to 1 10X barcode
# below are baisc parameters for supernova assembler
PROJECT_NAME="Human" # supernova's --id
THREADS=80 # supernova's --localcores
MEMORY=100 #GB # supernova's --localmem
MAX_READS=1200000000 # supernova's --maxreads # for supernova version <= 2.10
#MAX_READS=2140000000 #for supernova 2.11
MINSIZE=1000 # supernova's --minsize
#
# exec path below
#
# MODIFY HERE FOR YOU ENVIROMENT !!!
# "SCRIPT_PATH" is the YOUR-INSTALL-DIR directory .
SCRIPT_PATH="~/software/stlfr2supernova/"
#"SUPERNOVA" is the executable path of supernova. Various versions are acceptable.
SUPERNOVA="~/software/supernova-2.0.0/"
#"SOAP_FILTER is the executable path of SOAPFilter
SOAP_FILTER="/hwfssz1/ST_OCEAN/USER/guolidong/stLFR/data_pipeline/SOAPfilter_v2.2.1/SOAPfilter_v2.2"
#
# intermediate files that will be generated/needed below .
# DO NOT MODIFY BELOW
# UNLESS YOU KNOW WHAT YOU ARE DOING.
#
R1="tmp_r1.fq.gz" # the symbol-link concatenating different lanes.
R2="tmp_r2.fq.gz"
SPLIT="split_reads" # the prefix of splitted reads. split_reads.1.fq.gz & split_reads.2.fq.gz
BARCODE_FREQ="./barcode_freq.txt" # the barcode frequence information
CLEAN_BARCODE_FREQ="./clean_barcode_freq.txt" # the barcode frequence information after filter.
MERGE="merge.txt" # the stLFR->10X barcode mapping information
SUPERNOVA_R1="read-R1_si-TTCACGCG_lane-001-chunk-001.fastq.gz" # the output 10X reads
SUPERNOVA_I1="read-I1_si-TTCACGCG_lane-001-chunk-001.fastq.gz" # the output 10X reads
SUPERNOVA_R2="read-R2_si-TTCACGCG_lane-001-chunk-001.fastq.gz" # the output 10X reads
- If there is no file "barcode_freq.text" (or clean_barcode_freq.txt) in the directory "YourProjectRoot", then you can re-generate it with split_reads.1.fq.gz ( or split_reads.1.fq.gz.clean.gz) following the command below
gzip -dc split_reads.1.fq.gz | awk '!(NR%4-1)' | awk -F '[# |]' '{print$2}' | awk -F '/' '{print $1}' | sort | uniq -c | awk '{printf("%s\t%s\n",$2,$1);}' > barcode_freq.txt
- Why is the number of 10X reads smaller than that of the initial stLFR clean reads ?
- 1st. Reads sharing the "0_0_0" barcode are discarded since it refers to the barcode-decode-failure.
- 2nd. Very few reads sharing the same barcode are discarded since the barcode information is useless. Your can adjust it by changing BARCODE_FREQ_THRESHOLD.
- 3rd. The number of stLFR barcodes is much larger than that of 10X barcodes. Supernova can only accept reads up to # of 10 X barcodes * MAP_RATIO. Overflow part is discarded !
- stLFR has ~1.8 billion unique barcodes
- 10X has 4,792,320 unique barcodes
- Requirements
- Linux system && Bash
- Perl
- Supernova Assembler
- SOAPFilter ( optional )
- Limitations
- We can only do what Supernova can do.
- Resources
- There are two steps that might cause huge memory cost:
- step_0_split_barode.sh . this cost depends on your raw reads.
- step_3_run_supernova.sh. this cost can be limited by MEMORY ( in profile ).
- Only step_3_run_supernova.sh suports multi-threads ( by Supernova ).
- There are two steps that might cause huge memory cost:
[2] Direct determination of diploid genome sequences
- please contact guolidong@genomics.cn
- or please creat an issue in github.
- Read1 of stLFR raw reads
There is no barcode information within read1
@CL100117953L1C001R001_1/1
GGCCGGGGACTCAGGGAGAGATTCTAAGCTCTCTGTATGTGGAAGGCTCGCGAGACAGGAAACACACAAGACACGGGCGTTGTATACAGGTTCGGGCCGC
+
FGFFFFEF9FEFFFFFCF;FEA4EFE4FFFFGFGFCFFFEEFDDFE??@FCFCF<FGFEDFFF=F@)FFFF:D;9FCFF(:GEECG8F?F>E/FFCFFF0
@CL100117953L1C001R001_3/1
CCACCACGCCATTTGCTCCGAGAAGACAGCTACAACGCATGCGTATATAATAGAAACGCTCGTGCAAAACAAACTATATATAAAAAAATGATGACCAATG
+
<BE3BBAECB@CEBFBBD2F/EA>ABABDECFEEFCCAEB@BE%>8AAEAECE/;B=E6D<F6EC@FD93DB7E@;=??E/FFDDEB;EF;EF%C4E?EE
@CL100117953L1C001R001_4/1
CACCGATCCACAAGAGGCCTTACAGAATGGGAGCCAGCGAGTTGGCGGAAGTCAAGAAGCAAGTCGATGAACAGTTATAGAAGGGATACATCCGTCCGAG
+
FFFFFFFFFFFFFFEF>FFFFFFFFFFEFFFFFFFFFFFFFFEFFFFFCEF>FF>GCAFBFFFFFF;7FEFFCFE9FAFFEFFFF<=;'8EFFFAF=FFF
- Read2 of stLFR raw reads
The last 42-bp sequence provides the stLFR barcode information within read2
*There are 3 acceptable types of stLFR raw read2: 126bp or 142bp or 154bp. If your read2 does not belong to one of them, then it will stop. *
@CL100117953L1C001R001_1/2
GGGTATTACGCTTCTCAGCGGCCCGAACCTGTATACATCGCCCGTGTCTTGTGTGTTTCCTGTCTCGCGAGCCTTCCACATACAGAGAGCTTAGAATCTCATTACTAACGTCTTNTCTACAATACGAGGTTTTTGAGGAGAC
+
FFF*BF5FFFFGFFEFFFFEFFFFG@AFFFFFFFFFFFFFFFEFFFEFFFFEFDFBFDEFFGAFFFFFF(FEFFFFFEFD<AFFF<FB7FFFFF>;3EEEFFFFCFFFFFCE@6!0FFE;FFFFFF6.*D,CAFFFFFEFFF
@CL100117953L1C001R001_3/2
CAACAGCACTAGCTGGTGTCACGTGATCATCTCTGNAGATTTTCCTCTCTCTTCGCCTGGCGATGGGTTAGCACATTGNCAGAGGTAGTATCTATCANCGCAGGTATAAGCACTNCTCGAACGCCAATATTAAAGTTCGACC
+
.14,.<84,9B9(<?8>B,381B'?+841'9/69<!46;8'+?;3.,81693)48/1@/+)819&,6+344,69;(48!98%?6/+7/%))1.,(>.!))3<>91@@DBC%C&+!.;4C.44@4;E4'%'>9)1@AE3B:3?
@CL100117953L1C001R001_4/2
AGTCAACGCACATCCTCTTGGTTTTGTCTTTCTTCTCCACAAAGATAACCGGAGCACCCCAAGGCGACGTGCTCGGACGGATGTATCCCTTCTATAACGGCGACCACTGATCTGAGGCGTTAACGCGATGATTTGACATCTC
+
EFD8DBFFEFFFFEFFFFFFFFDCDFEFEFFFFFFFFFFFCEFFEEFFFFFFDFFFFFEFFEFFFFDFFDFFFFFF>EFFD>F<6FFFD'FF9FA;,D+9FFBFFFFFFD&>5(&>FFFFDFEFFFFE(F5FDFFEFFFFFF
- Read1 of stLFR reads
The barcode information is appended at the header line. #xxx_xxx_xxx part is the barcode.
@CL100117953L1C001R001_1#844_1383_927/1 1 1
GGCCGGGGACTCAGGGAGAGATTCTAAGCTCTCTGTATGTGGAAGGCTCGCGAGACAGGAAACACACAAGACACGGGCGTTGTATACAGGTTCGGGCCGC
+
FGFFFFEF9FEFFFFFCF;FEA4EFE4FFFFGFGFCFFFEEFDDFE??@FCFCF<FGFEDFFF=F@)FFFF:D;9FCFF(:GEECG8F?F>E/FFCFFF0
@CL100117953L1C001R001_3#1469_851_342/1 2 1
CCACCACGCCATTTGCTCCGAGAAGACAGCTACAACGCATGCGTATATAATAGAAACGCTCGTGCAAAACAAACTATATATAAAAAAATGATGACCAATG
+
<BE3BBAECB@CEBFBBD2F/EA>ABABDECFEEFCCAEB@BE%>8AAEAECE/;B=E6D<F6EC@FD93DB7E@;=??E/FFDDEB;EF;EF%C4E?EE
@CL100117953L1C001R001_4#643_1473_309/1 3 1
CACCGATCCACAAGAGGCCTTACAGAATGGGAGCCAGCGAGTTGGCGGAAGTCAAGAAGCAAGTCGATGAACAGTTATAGAAGGGATACATCCGTCCGAG
+
FFFFFFFFFFFFFFEF>FFFFFFFFFFEFFFFFFFFFFFFFFEFFFFFCEF>FF>GCAFBFFFFFF;7FEFFCFE9FAFFEFFFF<=;'8EFFFAF=FFF
- Read2 of stLFR reads
The barcode information is appended at the header line. #xxx_xxx_xxx part is the barcode.
@CL100117953L1C001R001_1#844_1383_927/2 1 1
GGGTATTACGCTTCTCAGCGGCCCGAACCTGTATACATCGCCCGTGTCTTGTGTGTTTCCTGTCTCGCGAGCCTTCCACATACAGAGAGCTTAGAATCTC
+
FFF*BF5FFFFGFFEFFFFEFFFFG@AFFFFFFFFFFFFFFFEFFFEFFFFEFDFBFDEFFGAFFFFFF(FEFFFFFEFD<AFFF<FB7FFFFF>;3EEE
@CL100117953L1C001R001_3#1469_851_342/2 2 1
CAACAGCACTAGCTGGTGTCACGTGATCATCTCTGNAGATTTTCCTCTCTCTTCGCCTGGCGATGGGTTAGCACATTGNCAGAGGTAGTATCTATCANCG
+
.14,.<84,9B9(<?8>B,381B'?+841'9/69<!46;8'+?;3.,81693)48/1@/+)819&,6+344,69;(48!98%?6/+7/%))1.,(>.!))
@CL100117953L1C001R001_4#643_1473_309/2 3 1
AGTCAACGCACATCCTCTTGGTTTTGTCTTTCTTCTCCACAAAGATAACCGGAGCACCCCAAGGCGACGTGCTCGGACGGATGTATCCCTTCTATAACGG
+
EFD8DBFFEFFFFEFFFFFFFFDCDFEFEFFFFFFFFFFFCEFFEEFFFFFFDFFFFFEFFEFFFFDFFDFFFFFF>EFFD>F<6FFFD'FF9FA;,D+9
- Read1 of 10X reads
The first 23-bp sequence is the 16-bp-10X-barcode + 7-bp-adaptor within read1
@ST-E0:0:SIMULATE:8:0:0:1 1:N:0:NAAGTGCT
CGGGTGTCACGCACCAATCGAGNGGCCGGGGACTCAGGGAGAGATTCTAAGCTCTCTGTATGTGGAAGGCTCGCGAGACAGGAAACACACAAGACACGGGCGTTGTATACAGGTTCGGGCCGC
+
FFFFFFFFFFFFFFFFFFFFFF#FGFFFFEF9FEFFFFFCF;FEA4EFE4FFFFGFGFCFFFEEFDDFE??@FCFCF<FGFEDFFF=F@)FFFF:D;9FCFF(:GEECG8F?F>E/FFCFFF0
@ST-E0:0:SIMULATE:8:0:0:2 1:N:0:NAAGTGCT
TCAACGACACACATACATCGAGNCACCGATCCACAAGAGGCCTTACAGAATGGGAGCCAGCGAGTTGGCGGAAGTCAAGAAGCAAGTCGATGAACAGTTATAGAAGGGATACATCCGTCCGAG
+
FFFFFFFFFFFFFFFFFFFFFF#FFFFFFFFFFFFFFEF>FFFFFFFFFFEFFFFFFFFFFFFFFEFFFFFCEF>FF>GCAFBFFFFFF;7FEFFCFE9FAFFEFFFF<=;'8EFFFAF=FFF
@ST-E0:0:SIMULATE:8:0:0:3 1:N:0:NAAGTGCT
TAGCTCCAGTTCGCGCATCGAGNGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCTCAGCCTGCTAACTAGCTATGCGGAGCCATCCCCCCGCGGCAAG
+
FFFFFFFFFFFFFFFFFFFFFF#E>EEFFAGEFBGAEG7EDFEEGFFFFDCEEFGFBE3DEFDF3FEF>GDF;EFEF?AFG>GFFFDDGAFEEF3FE@D?-EF64EF8CEFA*FD<E+GG'D,
@ST-E0:0:SIMULATE:8:0:0:4 1:N:0:NAAGTGCT
GGAACTTAGCCCAGCTATCGAGNCGACGCGAGGCGATCCTAGGGAAAGCTCCGAAGGTAGGACCGCCTCGGCGCCATACACCAGGAAAAAGGGCGTTTCCCCCGTGGCGCGGCTTGGGGTGGT
+
FFFFFFFFFFFFFFFFFFFFFF#FFCFGFGEFFFFFFFFF>FFFCEFGFEFFGCFFFFFFC@FDDFFFFF>FGFFFFFAFFEFFEFFEEFFF9FFFCFFFFFFGFE*FFEFF<FAEF3C'FF*
- Read2 of 10X reads
There is no barcode information within read2
@ST-E0:0:SIMULATE:8:0:0:1 2:N:0:NAAGTGCT
GGGTATTACGCTTCTCAGCGGCCCGAACCTGTATACATCGCCCGTGTCTTGTGTGTTTCCTGTCTCGCGAGCCTTCCACATACAGAGAGCTTAGAATCTC
+
FFF*BF5FFFFGFFEFFFFEFFFFG@AFFFFFFFFFFFFFFFEFFFEFFFFEFDFBFDEFFGAFFFFFF(FEFFFFFEFD<AFFF<FB7FFFFF>;3EEE
@ST-E0:0:SIMULATE:8:0:0:2 2:N:0:NAAGTGCT
AGTCAACGCACATCCTCTTGGTTTTGTCTTTCTTCTCCACAAAGATAACCGGAGCACCCCAAGGCGACGTGCTCGGACGGATGTATCCCTTCTATAACGG
+
EFD8DBFFEFFFFEFFFFFFFFDCDFEFEFFFFFFFFFFFCEFFEEFFFFFFDFFFFFEFFEFFFFDFFDFFFFFF>EFFD>F<6FFFD'FF9FA;,D+9
@ST-E0:0:SIMULATE:8:0:0:3 2:N:0:NAAGTGCT
TCCGTGGCCTAAACGGCCATAGTCCCTCTAAGAAGCTAGCTGCGGAGGGATGGCTCCGCATAGCTAGTTAGCAGGCTGAGGTCTCGTTCGTTAACGGAAT
+
FFFG@FFFDFFFEFGFFFF@FGEFFFFFFF@F<FFF6FGFEFFFE;FFFAEFFFE;FFFEDFF:>FFF)DFFFFFFBB?CF@FFFF(FEF@5/D@EF91F