RNA-seq Quality Assessment Assignment - Bi 623 (Summer 2025 Assignment/PS 2)

Overall assignment:

In this assignment, you will process electric organ and/or skeletal muscle RNA-seq reads for a future differential gene expression analysis. We will be completing the differential gene expression analysis in our last bioinformatics assignment of this class. You will learn how to use existing tools for quality assessment and read trimming, compare quality assessments to those created by your own software, and how to align and count reads. Additionally, you will learn how to summarize important information in a high-level report. You should create a cohesive, well written report for your "PI" about what you've learned about/from your data.

Be sure to upload all relevant materials by the deadline and double check to be sure that your offline repository is up-to-date with your online repository. Answers to questions should be included in your final, high-level, report as a pdf. This pdf should be generated using Rmarkdown and submitted to Canvas as well as GitHub. Be sure to keep a well-organized, detailed lab notebook!

Dataset:

Each of you will be working with 2 RNA-seq files from two different electric fish studies (PRJNA1005245 and PRJNA1005244). The methods for the PRJNA1005244 dataset are published and the methods for the PRJNA1005245 dataset are written in the third chapter of a thesis. For all steps below, process the two libraries separately. SRR assignments are here: /projects/bgmp/shared/Bi623/PS2/QAA_data_Assignments.txt. If you have time, consider claiming and processing additional RNA-seq raw sequencing files via this google doc. Although this is not extra credit, it will make our downstream RNA-seq analysis more interesting and your classmates will appreciate your efforts.

You are responsible for downloading this data from NCBI SRA, dumping into FASTQ files, and zipping those files. We are processing this data for use in a future assignment, so please keep your files well organized. Finally, rename the files to reflect Species_sample_tissue_age/size_sample#_readnumber.fastq.gz.

Reminder: This template file IS not your final product; however, it gives you a space to record all of the necessary information for your final report.

Part 1 – Read quality score distributions

Create a new conda environment called QAA and install FastQC, cutadapt, and Trimmomatic. Google around if you need a refresher on how to create conda environments. Recommend doing this in an interactive session, not the login node! Record details of how you created this environment in your lab notebook! Make sure you check your installation with:
- fastqc --version (should be 0.12.1)
Using FastQC via the command line on Talapas, produce plots of the per-base quality score distributions for R1 and R2 reads. Also, produce plots of the per-base N content, and comment on whether or not they are consistent with the quality score plots.
Run your quality score plotting script from your Demultiplexing assignment in Bi622. (Make sure you're using the "running sum" strategy!!) Describe how the FastQC quality score distribution plots compare to your own. If different, propose an explanation. Also, does the runtime differ? Mem/CPU usage? If so, why?
Comment on the overall data quality of your two libraries. Go beyond per-base qscore distributions. Examine the FastQC documentation for guidance on interpreting results and planning next steps. Make and justify a recommendation on whether these data are of high enough quality to use for further analysis.

Part 2 – Adaptor trimming comparison

If you haven't already in your QAA environment, install Cutadapt and Trimmomatic. Check your installations with:
- cutadapt --version (should be 5.0)
- trimmomatic -version (should be 0.39)
Using Cutadapt, properly trim adapter sequences from your assigned files. Be sure to read how to use Cutadapt. Use default settings. What proportion of reads (both R1 and R2) were trimmed?

Try to determine what the adapters are on your own. If you cannot (or if you do, and want to confirm), click here to see the actual adapter sequences used.

R1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

R2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
- Sanity check: Use your Unix skills to search for the adapter sequences in your datasets and confirm the expected sequence orientations. Report the commands you used, the reasoning behind them, and how you confirmed the adapter sequences.
Use Trimmomatic to quality trim your reads. Specify the following, in this order:
- HEADCROP: 8 bases
- LEADING: quality of 3
- TRAILING: quality of 3
- SLIDING WINDOW: window size of 5 and required quality of 15
- MINLENGTH: 35 bases
Be sure to output compressed files and clear out all intermediate files.
Plot the trimmed read length distributions for both paired R1 and paired R2 reads (on the same plot - yes, you will have to use Python or R to plot this. See ICA4 from Bi621). You can produce 2 different plots for your 2 different RNA-seq samples. There are a number of ways you could possibly do this. One useful thing your plot should show, for example, is whether R1s are trimmed more extensively than R2s, or vice versa. Comment on whether you expect R1s and R2s to be adapter-trimmed at different rates and why.
Bonus - Run FastQC on your trimmed data. Comment on differences you observe between the trimmed and untrimmed data. Include any figures needed to support your conclusions.

Part 3 – Alignment and strand-specificity

Install additional software for alignment and counting of RNA-seq reads. In your QAA environment, use conda to install:
- Star
- Picard
- Samtools
- NumPy
- Matplotlib
- HTSeq
Download the publicly available Campylomormyrus compressirostris genome fasta and gff file from Dryad and generate an alignment database from it. If the download fails, the files are available /projects/bgmp/shared/Bi623/PS2/campylomormyrus.fasta, /projects/bgmp/shared/Bi623/PS2/campylomormyrus.gff. Align the reads to your C. compressirostris database using a splice-aware aligner. Use the settings specified in PS8 from Bi621.

Important

You will need to use gene models to perform splice-aware alignment, see PS8 from Bi621. You may need to convert the gff file into a gtf file for this to work successfully.

Remove PCR duplicates using Picard MarkDuplicates. You may need to sort your reads with samtools before running Picard.

Use the following for running picard: picard MarkDuplicates INPUT=[FILE] OUTPUT=[FILE] METRICS_FILE=[FILENAME].metrics REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT

Using your script from PS8 in Bi621, report the number of mapped and unmapped reads from each of your 2 sam files post deduplication with picard. Make sure that your script is looking at the bitwise flag to determine if reads are primary or secondary mapping (update/fix your script if necessary).
Count deduplicated reads that map to features using htseq-count. You should run htseq-count twice: once with --stranded=yes and again with --stranded=reverse. Use default parameters otherwise. You may need to use the -i parameter for this run.
Demonstrate convincingly whether or not the data are from "strand-specific" RNA-Seq libraries and which stranded= parameter should you use for counting your reads for a future differential gene expression analyses. Include any commands/scripts used. Briefly describe your evidence, using quantitative statements (e.g. "I propose that these data are/are not strand-specific, because X% of the reads are y, as opposed to z."). This kit was used during library preparation. This paper may provide helpful information.

Tip

Recall ICA4 from Bi621.

BONUS - Turn your commands from part 1 and 2 into a script with a loop going through your two SRA files

Bonus (optional!)

Review the publication from PRJNA1005244 or the third chapter of the thesis for the PRJNA1005245 dataset. See if this information leads to any additional insight of your analysis.

Upload your:

lab notebook
Talapas batch script/code
FastQC plots
counts files generated from htseq-count (in a folder would be nice; only include the counts files that would be used in a future differential RNA-seq analysis and follow the naming conventions from the RNAseq meta data file: YourSubmittedHTSeqCountsFileName column)
pdf report (see below; turn into both Github AND Canvas)
and any additional plots, code, or code output

to GitHub.

Pdf report details

You should create a pdf file (using Rmarkdown) with a high-level report including:

all requested plots
answers to questions
mapped/unmapped read counts from PS8 script (in a nicely formatted table)
It should be named QAA_report.pdf
Include at the top level of your repo
ALSO, submit it to Canvas.

Tip

You may need to install LaTeX to knit your rmarkdown into a pdf file. Run tinytex::install_tinytex() to install it on R.

The three parts of the assignment should be clearly labeled. Be sure to title and write a descriptive figure caption for each image/graph/table you present.

Tip

Think about figure captions you've read and discussed in Journal Club. Find some good examples to model your captions on.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
metadata		metadata
.gitignore		.gitignore
Assignment2_template.Rmd		Assignment2_template.Rmd
README.md		README.md
RNAseq_meta_campys.csv		RNAseq_meta_campys.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Quality Assessment Assignment - Bi 623 (Summer 2025 Assignment/PS 2)

Overall assignment:

Dataset:

Part 1 – Read quality score distributions

Part 2 – Adaptor trimming comparison

Part 3 – Alignment and strand-specificity

Bonus (optional!)

Upload your:

Pdf report details

About

Uh oh!

Releases

Packages

hopehealey/QAA

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Quality Assessment Assignment - Bi 623 (Summer 2025 Assignment/PS 2)

Overall assignment:

Dataset:

Part 1 – Read quality score distributions

Part 2 – Adaptor trimming comparison

Part 3 – Alignment and strand-specificity

Bonus (optional!)

Upload your:

Pdf report details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages