Skip to content

Quality control of high throughput sequence data

Matt Pinder edited this page Nov 6, 2020 · 13 revisions

Preparing for the exercises

  • The raw, compressed FASTQ files you'll be using in today's exercises can be found in
    /db/Teaching/Assembly/Sulfitobacter/raw.
  • Make sure you have an account on the computer cluster Albiorix.
    Usernames and passwords will be passed out before we begin.
  • Run the exercises in two terminal tabs - one where you log into Albiorix and one for accessing the files on your local computer.
  • After each analysis has run, take a look at the output produced, to be sure that the process hasn't encountered any unexpected errors.
    • Files to check include log files in the analysis result subfolders, and any files ending in .sge.e##### and .sge.o#####.
  1. Start by logging in to Albiorix (note - replace studentX with the username you have been given).

ssh -Y studentX@albiorix.bioenv.gu.se

  1. When you first log in you will be located in your home directory. However, for this exercise you should move to a different part of the file system.

cd /nobackup/data18/Assembly_exercise/studentX

For this part of the exercise, you will be using the files runFastQC.sge and runCutadapt.sge. These are templates for scripts that can be sent to the queue system in order to run FastQC and Cutadapt, once you have filled in the relevant commands.

Assessing the raw data

  1. Inspect the first few lines of the FASTQ files with zless.

    • Does the format of the files look as you expect?
  2. Run FastQC on your FASTQ files, by adding the relevant command to the file runFastQC.sge.

    • Hint - the basic form of the command, found at fastqc --help, should be sufficient.
    • Make sure you include the option -t $NSLOTS, to ensure you use all of your allocated resources for the task.
    • Once you're happy with the command, submit it to the queue with qsub runFastQC.sge.
  3. Explore the FastQC report (firefox *.html) and note your observations. For example:
    a. Does the per base sequence quality look acceptable?
    b. Does the per base sequence content suggest any bias?
    c. Do any adapters need removing?

If Firefox is slow, rsync the files to your local computer and view them there.

  1. Discuss your observations with another student or an instructor.

Trimming the raw data

  1. Run Cutadapt based on your observations, by adding the relevant command to the file runCutadapt.sge.
    • Hint - run cutadapt --help to see how the command should be formatted.
    • Make sure you include the option -j $NSLOTS, to ensure you use all of your allocated resources for the task.
    • Once you're happy with the command, submit it to the queue with qsub runCutadapt.sge.

The following commands may be helpful:

  • -q sets the quality threshold for base trimming.
  • The following commands affect only the forward (R1) reads. Use the Cutadapt help page to identify the equivalent reverse (R2) commands you need.
    • -u defines the number of bases to hard trim from the 5' end of the forward reads.
    • -a defines the sequence of the 3' adapter to remove from forward reads.
    • -o defines the output file names for trimmed forward reads.

Assessing the trimmed data

  1. Re-run FastQC on the newly-trimmed files.

  2. Explore the FastQC report, and check whether the data looks good compared to the first report.

    • Hint - how many reads have been lost during quality control?

Are you happy with how your data looks?

  • No - Try rerunning Cutadapt with some different parameters, generate some new FastQC files for the results, and see whether this looks any better.
  • Yes - Excellent! You can either continue directly to the Assembly section of the exercise, or try out the alternative QC approach below.

Optional - An alternative approach

As mentioned in the lecture, Trim Galore! is a wrapper script containing both Cutadapt and FastQC. While it does have some limitations in how it allows you to interact with these programs, you may find it better to work with than Cutadapt and FastQC separately - when multiple tools are available, it's worth trying them out to see which works best for you!

S1. Re-run the trimming using trim_galore, by adding the relevant command to the file runTrimGalore.sge.

  • Hint - run trim_galore --help
  • Make sure you include the option -j $NSLOTS, to ensure you use all of your allocated resources for the task.
  • Once you're happy with the command, submit it to the queue with qsub runTrimGalore.sge.
  • How do the results compare to running Cutadapt and FastQC separately?

Clone this wiki locally