-
Notifications
You must be signed in to change notification settings - Fork 0
Cleaning and preprocessing RNA Seq
Direct sequencing of complementary DNA (cDNA) using high-throughput sequencing technologies (RNA-Seq) allows a very accurate understanding of the transcriptome and its isoforms. It is a process that involves multiple stages, such as reverse transcription, amplification, fragmentation, purification, ligation of adapters and sequencing [1].
In theory, through RNA-seq we should be able to identify and quantify accurately all RNA species, small or large, in low or high abundance. However, improper operations in any of the necessary steps can generate partial or even unusable data. In addition, the intrinsic biases of RNA-seq (such as GC bias and nucleotide composition bias) and the complexity of the transcriptome can also make the data imperfect. Therefore, the integral evaluation of the quality and the preprocessing of the raw data are the first and most critical steps for all subsequent analyzes and the correct interpretation of the results [2].
There are many tools for quality control and preprocessing. For example, FASTQC [3] is a Java-based tool that provides quality profiles by base and by reading. Cutadapt [4] finds and removes sequences of adapters, primers, poly-A tails and other types of unwanted sequences. Trimmomatic [5] is another tool to eliminate adapters that can also filter readings based on their quality.
A typical combination is to use FASTQC for quality control, Cutadapt to remove the adapters and Trimmomatic to filter the readings. But the need to read and load data multiple times makes preprocessing a slow and inefficient task with many disk readings and writes. AfterQC [6] is a complete tool that can perform all the necessary operations but, being implemented in Python, it is relatively slow.
For all these reasons we will use Fastp tool [7] developed in C++ and with multi-threading support. Fastp incorporates most of the features of FASTQC + Cupdatadpt + Trimmomatic + AfterQC but running 2 to 5 times faster than either of them separately. It carries out the quality control, the elimination of adapters, the filtering of the readings and the correction of the bases in a single pass, also supporting paired readings (paired-end reads).
But, although Fastp is a tool that produces complete quality reports, the drawback we find, as with the rest of quality control tools, is that they generate a report by each sample. MultiQC tool [8] can summarize all the reports in a single report, allowing a more global view of the quality status of our data. It supports multiple tools but has not yet implemented the integration with Fastp. For this reason, even if it is redundant, we will also use FASTQC for pre and post quality control, summarizing the reports by each group with MultiQC.
[1] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: A revolutionary tool for transcriptomics” Nature Reviews Genetics, vol. 10, no. 1. pp. 57–63, 2009.
[2] X. Li, A. Nair, S. Wang, and L. Wang, “Quality control of RNA-seq experiments” in RNA Bioinformatics, 2015, pp. 137–146.
[3] S. Andrews, “FastQC: A quality control tool for high throughput sequence data” Http://Www.Bioinformatics.Babraham.Ac.Uk/Projects/Fastqc/, 2010.
[4] M. Martin, “Cutadapt removes adapter sequences from high-throughput sequencing reads” EMBnet.journal, vol. 17, no. 1, p. 10, 2011.
[5] A. M. Bolger, M. Lohse, and B. Usadel, “Trimmomatic: A flexible trimmer for Illumina sequence data” Bioinformatics, vol. 30, no. 15, pp. 2114–2120, 2014.
[6] S. Chen, T. Huang, Y. Zhou, Y. Han, M. Xu, and J. Gu, “AfterQC: Automatic filtering, trimming, error removing and quality control for fastq data” BMC Bioinformatics, vol. 18, 2017
[7] S. Chen, Y. Zhou, Y. Chen, and J. Gu, “fastp: an ultra-fast all-in-one FASTQ preprocessor” bioRxiv, p. 274100, 2018.
[8] P. Ewels, M. Magnusson, S. Lundin, and M. Käller, “MultiQC: Summarize analysis results for multiple tools and samples in a single report” Bioinformatics, vol. 32, no. 19, pp. 3047–3048, 2016.