This program analyzes a FASTQ file and counts the number of reads falling into predefined length bins.
It can be used to generate a simplified summary of the read length distribution in a FASTQ file, to be then used to generate a similar FASTX file.
The program:
- Reads a (compressed) FASTQ file.
- Counts the number of reads falling into each of 16 predefined length bins.
- Outputs a CSV-formatted result showing the number of reads in each bin.
To compile the program, you need a C compiler (such as gcc) and the zlib library installed. Use the following command:
gcc -o fastq_length_counter fastq_length_counter.c -lz
This will create an executable named fastq_length_counter
.
Run the program from the command line, providing the path to a gzip-compressed FASTQ file as an argument:
./n50_binner path/to/your/file.fastq.gz
The program will process the file and output the results to stdout in CSV format.
- Running the program:
./fastq_length_counter sample.fastq.gz
- Saving the output to a file:
./fastq_length_counter sample.fastq.gz > length_distribution.csv
- Processing multiple files:
for file in *.fastq.gz; do
echo "Processing $file"
./fastq_length_counter "$file" > "${file%.fastq.gz}_length_distribution.csv"
done
The output is in CSV format with two columns:
- Bin: The upper limit of the length bin
- Number of Reads: The count of reads falling into that bin
Example output:
Bin,Number of Reads
10,0
100,5
1000,1000
2500,5000
...
- The program uses 16 predefined bins: 10, 100, 1000, 2500, 5000, 10000, 20000, 35000, 50000, 75000, 100000, 200000, 300000, 500000, 750000, and 1000000.
- Reads longer than 1,000,000 bases will be counted in the last bin.
- The program assumes the input file is in the standard FASTQ format and is gzip-compressed.
- zlib library for reading gzip-compressed files
Remember to install zlib development files before compiling. On Ubuntu or Debian, you can do this with:
sudo apt-get install zlib1g-dev
This program is provided under the MIT License. See the source code for full license text.
Andrea Telatin, 2023 Quadram Institute Bioscience