A toolset for handling sequencing data with unique molecular identifiers (UMIs)
This tools set requires Python 3.
To install umitools
, run
pip3 install umitools # add --user if you want to install it to your own directory
wget -O clipped.fq.gz "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.sRNA-seq.fq.gz"
umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq
wget -O "r1.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r1.fq.gz"
wget -O "r2.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r2.fq.gz"
umitools reformat_fastq -l r1.fq.gz -r r2.fq.gz -L r1.fmt.fq.gz -R r2.fmt.fq.gz
And it will output some stats for your UMI RNA-seq data.
2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g., fmt.bam
).
To download an example, run
wget -O fmt.bam https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.sorted.bam
To mark the reads with PCR duplicates (and assuming you want to use 8 threads), run
umitools mark_duplicates -f fmt.bam -p 8
And it will produce fmt.deumi.sorted.bam
in which reads that are identified as PCR duplicates will have the flag 0x400
. If your downstream analysis (e.g., Picard) can take into consideration this flag, then you are good to go! Otherwise, you can just eliminate PCR duplicates:
samtools view -b -h -F 0x400 fmt.deumi.sorted.bam > fmt.deumi.F400.sorted.bam
You can then feed the bam file without PCR duplicates to your downstream analysis.
For UMI RNA-seq, the UMI locator in each read is required to exactly match GGG, TCA, or ATC. You can customize the locator sequence by setting --umi-locator LOCATOR1,LOCATOR2,LOCATOR3,LOCATOR4
when you run umi_reformat_fastq
.
For UMI small RNA-seq, the default setting requires that the 5' UMI locator in each read should match NNNCGANNNTACNNN
or NNNATCNNNAGTNNN
, AND 3' UMI locator should match NNNGTCNNNTAGNNN
where N's are not required to match and there is at most 1 error across all non-N positions. You can customized the locator sequence for small RNA-seq by setting --umi-pattern-5
and --umi-pattern-3
. You can further tweak the number of errors allowed by changing N_MISMATCH_ALLOWED_IN_UMI_LOCATOR
in the script.
A simple in silico PCR simulator for UMI reads. Run it with -h
to see options.
In addition to providing subcommands to umitools
(e.g., umitools mark_duplicates
), these commands can also be called individually.
umitools reformat_fastq
is equivalent toumi_reformat_fastq
.umitools mark_duplicates
is equivalent toumi_mark_duplicates
.umitools reformat_sra_fastq
is equivalent toumi_reformat_sra_fastq
.
There are many tools to remove adapters. This is just one example. To process a fastq (raw.fq.gz
) file from your UMI small RNA-seq data, you can first remove the 3' end small RNA-seq adapter. For example, you can use fastx_clipper
from the FASTX-Toolkit and the adapter sequence is TGGAATTCTCGGGTGCCAAGG
:
zcat raw.fq.gz | fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 48 -c -Q33 2> raw.clipped.log | gzip -c - > clipped.fq.gz
where -l 48
specified the minimum length of the reads after the adapter removal, since I want to make sure all reads are at least 18 nt (18 nt + 15 nt in the 5' UMI + 15 nt in the 3' UMI).
To see which reads have improper UMIs, run
umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq --reads-with-improper-umi sra.improper_umi.fq
where sra.umi.fq
contains all the non-duplicate reads and sra.dup.fq
contains all duplicates.
- Grab the version on GitHub:
git clone https://github.com/weng-lab/umitools.git
- Install it in editable mode:
pip3 install -e /path/to/umitools
Fu, Y., Wu, P.-H., Beane, T., Zamore, P.D., and Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531.
Yu Fu (Yu.Fu {at} umassmed.edu)