How to calculate the coverage

Coverage    =  Total number of bases / genome size

            =  (read count * read length) / genome size

So let's see how can we calculate the total number of bases in our data.

What we have here is paired end data in the form of a fastq file. Each sequence of the FASTQ file record is associated with 4 lines:

@M00704:48:000000000-AH6H7:1:1101:15536:1333 1:N:0:2 
ACATATACTTCTTCATCAGATTCATTATTATCAGTGAATGATGCTTGATCTTGTTCAGACCCCT
+
1>>1>FFFDDFDG3FGB3F1EGH3FEGFFDBFFFHFHHHFGDGHFD1BGHFHH1FGH2GHHHGG

There are multiple ways to calculate the number of sequence in a FASTQ file and in here I am showing one of such methods to calcuate the number of sequences in the FASTQ file. So to calculate the number of sequences in a FASTQ file you need to calculate the total number of lines and then divide it by 4.

In here I am using a awk command to calculate the number of sequences in a FASTQ file:

awk '{s++}END{print s/4}' Sample_R1.fastq

The above command will give you the number of sequences in a single file. As we are using paired end data, we need to multipy that number by 2 to get the total number of reads in the library.

Total number of reads        = (number of reads in a single file) * 2

As the readlength of a read is 250bp, to find the total number of bases, we need to multipy the Total number of reads by read length of 250bp.

Total number of bases       = (Total number of reads) * 250

As now we now know the total number of bases, and the estimated genome sieze, we can now calculate the coverage using the above values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coverage.md

coverage.md

How to calculate the coverage

Files

coverage.md

Latest commit

History

coverage.md

File metadata and controls

How to calculate the coverage