Skip to content

Latest commit

 

History

History
228 lines (196 loc) · 5.89 KB

README.md

File metadata and controls

228 lines (196 loc) · 5.89 KB

bioinformatics-one-liners

Useful bioinformatics one-line commands.

Indexing files

Indexing gtf files with tabix can be tricky. (from Book Bioinformatics Data Skills)

(zgrep -v "^#" gencode.v32.annotation.gtf.gz | sort -k1,1 -k4,4n) | bgzip > gencode.v32.annotation.gtf.bgz \
&& tabix -p gff gencode.v32.annotation.gtf.bgz

Fetching annotations

get CpG Island annotation in BED format (ref, BEDOPS needed)

wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
  | gunzip -c \
  | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
  | sort-bed - \
  > cpgIslandExt.hg38.bed

Dealing with genomic annotations

extract genomic intervals of genes in BED format from GENCODE annotation

cat gencode.v32lift37.annotation.gtf \
  | grep -v '^#' \
  | grep '^chr' \
  | awk 'BEGIN {FS="\t"; OFS="\t"} {if ($3 == "gene") print $1,$4,$5,$9,$6,$7}' \
  | bedtools sort -i stdin \
  > gencode.v32lift37.genes.bed

annotate peaks (e.g. ChIP-seq) with closest 5 genes that are at most 3kb away

bedtools closest -a peaks.bed -b gencode.v32lift37.genes.bed -d -k 5 \
  | awk 'BEGIN {FS="\t"; OFS="\t"} {if (-3000 <= $17 && $17 <= 3000) print $1,$2,$3,$14,$16,$17} \
  > peaks.annotated.bed

E-utilities

Return a list of all Entrez database names

curl https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

GDC API

Map GDC file uuid to case barcode

curl 'https://api.gdc.cancer.gov/files/{file_uuid}?fields=cases.submitter_id&pretty=true'

samtools (doc)

sort sam/bam files

samtools sort SRR1234567/SRR1234567.bam -o SRR1234567/SRR1234567.sorted.bam --threads 4

index sam/bam files (this creates SRR1234567.sorted.bam.bai file)

samtools index SRR1234567/SRR1234567.sorted.bam

get alignments on chromosome 20 only

samtools view SRR1234567/SRR1234567.bam chr20 -b > SRR1234567/SRR1234567_chr20.bam

count the number of reads in sam/bam files

samtools view -c SRR1234567/SRR1234567.bam

count the number of mapped reads in sam/bam files

samtools view -F 4 -c SRR1234567/SRR1234567.bam

count the number of unmapped reads in sam/bam files

samtools view -f 4 -c SRR1234567/SRR1234567.bam

show statistics of sam/bam files

samtools stats SRR1234567/SRR1234567.bam

check whether my alignment file is sorted lexicographically (list all the chromosomes)

samtools view SRR1234567/SRR1234567.bam | cut -f3 | uniq

subsample aligned reads (e.g. use -s 42.1 to use random seed 42 to subsample 0.1 (10%) of reads)

samtools view -s [SEED].[PROPORTION] SRR1234567/SRR1234567.bam

Dealing with fastq files

count the number of reads in fastq files

cat SRR1234567/SRR1234567.fastq | echo $((`wc -l` / 4))

count the number of reads in gzipped fastq files

zcat SRR1234567/SRR1234567.fastq.gz | echo $((`wc -l` / 4))

concatenate multiple fastq.gz files (just use cat!)

cat sample1.fastq.gz sample2.fastq.gz sample3.fastq.gz > concatenated.fastq.gz

fastq-dump best practice

fastq-dump --split-3 --skip-technical --gzip --readids --clip --read-filter pass <SRR->

use ascp to prefetch sra files (aspera-connect required, and use absolute path for aspera)

prefetch --ascp-path \
'/path/to/home/.aspera/connect/bin/ascp|/path/to/home/.aspera/connect/etc/asperaweb_id_dsa.openssh' \
--ascp-options -l150M SRR1234567

grep

print files in current directory which contain a pattern

grep -l <pattern> *

discard alternate contigs from bed files (i.e. drop entries at chromosome named chr~_alt or chr~_random, ...)

cat my_intervals.bed | grep -E '^chr[0-9XY]+\s' > my_intervals.filtered.bed

sed

extract desired part of each line using substitution

cat file | sed -e 's/.*\([0-9]*\).*/\1/'

join lines of two files on a common field in the first column

join file1.txt file2.txt > out.txt

comm (mnemonics: common)

print only lines common to both files

comm -12 <(sort file1) <(sort file2)

print only lines unique to first file

comm -23 <(sort file1) <(sort file2)

print only lines unique to second file

comm -13 <(sort file1) <(sort file2)

print only lines common to three files

comm -12 <(sort file1) <(sort file2) | comm -12 <(sort file3) -

tr (mnemonics: translate)

get complementary DNA sequence

echo ATGCTGTAGTC | rev | tr 'ACGT' 'TGCA'

cut

print only the second column of the tab-delimited file

cat file.txt | cut -f 2

print only the first column of the comma-delimited file

cat file.txt | cut -d',' -f 1

print the first and second columns of the comma-delimited file

cat file.txt | cut -d',' -f 1,2

convert tsv to csv

cat file.tsv | cut -f - --output-delimiter , > file.csv

print unique values in second column

cat file.tsv | cut -f2 | uniq

Linux basics

show the size of subdirectories

du -h

only show the size of current directory

du -sh

print a file except first line

cat file.txt | tail -n +2

prepend a line to a file

sed '1i header' file.txt > prepended_file.txt

iterate over files in a directory

for f in *.sra; do fastq-dump $f; done

show from the a-th line to the b-th line of a file

head file -n $b | tail -n $(($b-$a+1))

Statistics

Multiple testing correction

from statsmodels.stats import multitest
reject, pvals_corrected, alphacSidak, alphacBonf = multitest.multipletests(p_values, method='fdr_bh')