Skip to content

A Python script to filter and extract information from GTF files based on chromosome names, designed to be easily accessible for biologists without extensive programming experience.

License

Notifications You must be signed in to change notification settings

sivkri/GTF-Filter-for-Biologists

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GTF Filter for Biologists

This repository contains a simple Python script that filters a GTF (Gene Transfer Format) file based on the specified chromosome name. The script is designed to be easily accessible and understandable for biologists who may not have extensive programming experience.

Prerequisites

  • Python 3.x

Usage

  1. Clone the repository to your local machine or download the script file directly.
  2. Open a terminal or command prompt.
  3. Navigate to the directory where the script is located.
  4. Run the script with the following command:
python gtf_filter.py <gtf_file> <chromosome_name> <output_file>

Replace <gtf_file> with the path to your GTF file, <chromosome_name> with the desired chromosome name, and <output_file> with the name of the output file.

  • The <gtf_file> should be a valid GTF file containing gene annotations.
  • The <chromosome_name> parameter should be provided in the appropriate format for your GTF file. For ENSEMBL, use chromosome numbers (e.g., 1, 2, 3, etc.). For UCSC, use the "chr" prefix followed by the chromosome number (e.g., chr1, chr2, chr3, etc.).
  • The <output_file> parameter specifies the name of the file where the filtered results will be saved. This file will be created in the same directory as the script.

Example

Suppose you have a GTF file named annotations.gtf and you want to extract gene annotations for chromosome 1 in ENSEMBL format. You can run the script as follows:

python gtf_filter.py annotations.gtf 1 filtered_annotations.gtf

This command will filter the annotations.gtf file and save the filtered results in a new file named filtered_annotations.gtf.

Output

The script will read the input GTF file line by line and write matching lines for the specified chromosome to the output file. The output file will contain the filtered gene annotations for the requested chromosome.

Please note that the script assumes the input GTF file is correctly formatted and follows the GTF specification.

Extract information using python script from .gtf file

To filter and extract information based on your input

The below procedure can be carrried on linux commands

Example Usage

python extractGTF.py homo_sapiens_short.gtf 11 11_result.txt

Explanation of input sys.argv[1] # a gtf file name i.e., homo_sapiens_short.gtf

sys.argv[2] # the requested chromosome name (seqname); for ENSEMBL use e.g. 11, for UCSC: chr11, etc.

sys.argv[3] # open the output file i.e., 11_result.txt

Alternative option

If you want to shortlist based on "feature" then replace line.split('\t')[0] with line.split('\t')[2]

Under sys.argv[2], provide input as "start_codon" or "CDS"

Traditional usage

Step0. Unzipping and extracting the contents of .gz file

gunzip 1.gtf.gz

ls # check if 1.gtf.gz has been unzipped to 1.gtf

Step1. View the basic information of the file

cat 1.gtf | head Display the first 10 lines of 1.gtf file

cat 1.gtf | tail Display the last 10 lines of the 1.gtf file

cat 1.gtf | head -15 Display the first 15 lines of the 1.gtf file (the input value 15 can be replaced by other integers)

ls -lh 1.gtf Display the size of the 1.gtf file

wc -l 1.gtf Statistics 1.gtf file line number

Use grep -v to exclude comment line (parts beginning with #) and blank lines with a length of 0

'^' matches the beginning of the line, '$' matches the end of the line

'^#' matches lines starting with '#'

If '^$' can match a certain line, it means that the line is empty (the beginning of the line is followed by the end of the line, and there are no other characters in between)

grep -v "^#" 1.gtf | grep -v '^$' | wc -l

Filter blank empty lines (in addition to line breaks, lines that may also include blank characters, such as spaces and tabs), display the first 10 lines of results '\s' matches a blank character, '*' means that such a character will appear 0 or more times, '^\s*$' indicates that there are only 0 or more blank characters between the beginning and end of a line

cat 1.gtf | awk '$0!~/^\s*$/{print}' | head -10

grep -v '^\s*$' 1.gtf | head -10

step2. Data extraction

step2.1 Filter specific columns

Select the data of 1-3 columns (the following two commands are available)

awk's default line separator is space " " and tab "\t"

After awk divides each row into columns by delimiter, the values of columns 1, 2, and 3 can be obtained through $1, $2, and $3 ($0 represents the content of the entire row)

cat 1.gtf | awk ' { print $1, $2, $3 } ' | head

The default delimiter for cut is "\t"

cat 1.gtf | cut -f 1,2,3 | head

Eg. For example, I only need columns 1, 3, 4, and 5 of the GTF file, which are chr, feature, start, and end.

cut -f 1,3,4,5 1.gtf | head

step2.2 Filter specific rows

Suppose we want to extract the row whose third column is gene, and only display the information in columns 1, 3, and 9.

awk separates each line into columns according to the default line separator, and prints out columns 1, 3, and 9 for the line whose third column is equal to "gene"

cat 1.gtf | awk '$3 == "gene" { print $1, $3, $9 } ' | head

step3. Extract and calculate specific features

step3.1 Extract and count featrue types

grep -v '^#' 1.gtf |awk '{print $3}'| sort | uniq -c #Extract and count how many types of features there are

step3.2 Calculate the feature length of a specific feature

The value in the 5th column minus the value in the 4th column + 1, that is, the length of the feature feature

The coordinates of the gff/gtf file start from 1, and the range is a closed interval (the coordinates of the bed file we will encounter later start from 0, and the range is left closed and right open)

cat 1.gtf | awk ' { print $3, $5-$4 + 1 } ' | head

Calculate the total length of all CDS

cat 1.gtf | awk 'BEGIN{size=0;}$3 =="CDS"{ len=$5-$4 + 1; size += len; print "Size:", size } ' | tail -n 1

Or use awk to only output the statistical results at the end:

cat 1.gtf | awk 'BEGIN{L=0;}$3 =="CDS"{L+=$5-$4 + 1;}END{print L;}'

Or use the feature of awk automatic initialization:

cat 1.gtf | awk '$3 =="CDS"{L+=$5-$4 + 1;}END{print L;}'

Calculate the average length of chromosome 1 cds

awk can read input from both pipe and file

awk 'BEGIN {s = 0;line = 0;}$3 =="CDS" && $1 =="I"{ s += $5-$4+1;line += 1}END {print "mean="s/ line}' 1.gtf

Step3.3 Separate and extract the gene name

Separate and extract the gene name from the gtf file and calculate its length

split is a built-in function of awk, because the string is split according to the specified delimiter. It takes three parameters as input string, output list and delimiter

Here x is the output list, no prior declaration is required in awk

The list subscript of awk starts from 1

gsub is also a built-in function of awk, used to replace

gsub("\"", "", name) is to remove the quotation marks in the name. " itself is a special character in awk. \ is the escape symbol, \" tells awk to treat the " here as a normal character

cat 1.gtf | awk '$3 == "gene"{split($10,x,";");name = x[1];gsub("\"", "", name);print name,$5- $4+1}' | head

Step4. Extract the data and store it in a new file

This stage is mainly to learn to extract data and save it into a new file, for example, find the 3 exons with the longest length, and report their length.

Two methods are introduced here.

The first is to directly extract and calculate the longest 3 exons, report their length, and save them in a .txt file;

The second method is to write an executable file run.sh, find the 3 longest exons, and report their lengths.

step4.1 Demonstration of extracting data and storing it in a txt file

We use the output redirection operator > to save what would be printed to the terminal screen by default into a disk file 1.txt.

grep exon 1.gtf | awk '{print $5-$4+1}' | sort -n | tail -3 > 1.txt

If 1.txt is an existing file, the content of the output redirection operator > will overwrite the original file.

If we want to append the output to the redirected file instead of overwriting the original file when the redirected file exists, we can use the >> operator.

Then enter the command less 1.txt or vi 1.txt to enter the vi general mode interface to display the output results.

At this time, press :q or :wq as input method state to return to the terminal shell window.

When typing less to view a file, you can also use q to exit the viewing mode.

Copy 1.txt to /home/, and you can view the 1.txt file in the shared directory of the host.

step4.2 Executable file editing demonstration

The first step is to enter the command to enter the vi editing interface.

vi run.sh

In the second step, after pressing the i key to switch to the insert mode, write down the contents of the rush.sh file as follows:

#!/bin/bash

grep exon *.gtf | awk '{print $5-$4+1}' | sort -n | tail -3

#!/bin/bash on the first line tells the operating system to use /bin/bash as the interpreter to run the script

The third step is to press Esc or ctrl+[ to switch back to the normal mode, enter :wq to exit the vi editor, and type after the command line:

Give the script executable permissions

chmod u+x run.sh

run the script

./run.sh

Since we added the #!/bin/bash line, the operating system will use /bin/bash to run the script. If run.sh is not given executable permission, running ./run.sh will prompt permission denied, However if you manually specify the interpreter and use it bash run.sh, can also run normally

The output is consistent with the content of 1.txt.

About

A Python script to filter and extract information from GTF files based on chromosome names, designed to be easily accessible for biologists without extensive programming experience.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages