This repository contains commands and scripts that haven been generated and used by the Prof. Peter Dedon’s group at MIT and the Singapore-MIT Alliance for Research and Technology (SMART) for codon usage analysis in the manuscript:
“Pyridoxal phosphate-dependent biosynthesis of aminovaleramide by AvaS in tRNA”
Authors: Jingjing Sun, Junzhou Wu, Yifeng Yuan, Seetharamsing Balamkundu, Agnieszka Dziergowska, Hazel Chay Suen Suen, Dwijapriya, Liang Cui, Léo Hardy, Megan En Lee, Grazyna Leszczynska, Chuan-Fa Liu, Zeynep Baharoglu, Laurence Drouard, Steven D. Bruner, Thomas J. Begley, Valérie de Crécy-Lagard, Peter C. Dedon
Contact Yifeng Yuan at yuanyifeng@ufl.edu
Yifeng Yuan, Ph.D. and Peter C. Dedon (Principal Investigator)
Version History
v0.9.0 --2026/03/02 --create the repository and migrate data.
mafft v7.520 https://mafft.cbrc.jp/alignment/software/
BMGE v1.12 http://ftp.pasteur.fr/pub/gensoft/projects/BMGE/
MASH v2.3 https://github.com/marbl/mash
seqkit v2.8.0 https://bioinf.shenwei.me/seqkit/
python v3.12 https://www.python.org/
raxml-ng v1.1.0 https://github.com/amkozlov/raxml-ng
ncbi_blast v2.15.0 https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.15.0/
java v11 and higher https://www.oracle.com/java/technologies/downloads/#java11
1.1 Dowload genome IDs in a list. For example, Select representative genomes using filters: Organism: bacteria, Reference: representative, Quality: Good, Status: Complete.
bac_rep_4245.txt (4245 representative bacterial genomes)
Repeat the step for other genus and species.
Aeromonas_gid.txt (Aeromonas)
Flavobacterium_gid.txt (Flavobacterium)
Escherichia_gid.txt (Escherichia)
Campylobacter_gid.txt (Campylobacter)
Acinetobacter_gid.txt (Acinetobacter)
other_gid.txt (other genus including Chitinophaga, Hymenobacter, Pedobacter, Pseudoalteromonas)
vibrio_gid.txt (Vibrio)
Streptomyces_gid.txt (Streptomyces)
Shewanella_gid.txt (Shewanella)
pseudo_gid.txt (Pseudomonas)
Arcobacter_gid.txt (Arcobacter)
Chryseobacterium_gid.txt (Chryseobacterium)
Abmn_gid.txt (Acinetobacter baumannii)
Ahyd_gid.txt (Aeromonas hydrophila)
Sone_gid.txt (Shewanella oneidensis)
Vcho_gid.txt (Vibrio cholerae)
Vpara_gid.txt (Vibrio parahaemolyticus)
1.2 Retrieve data from BV-BRC (https://www.bv-brc.org/), including CDS sequences (ffn files), genome sequences (fna files) and protein sequences (faa files)
gid_list=${list of genome IDs.txt}
mkdir fnafiles || true
mkdir ffnfiles || true
mkdir faafiles || true
for id in $(cat ${gid_list}); do
wget -qN "ftp://ftp.patricbrc.org/genomes/${id}/${id}.fna" -P fnafiles
wget -qN "ftp://ftp.patricbrc.org/genomes/${id}/${id}.PATRIC.ffn" -P ffnfiles
wget -qN "ftp://ftp.patricbrc.org/genomes/${id}/${id}.PATRIC.faa" -P faafiles
done
Retrieve selected protein sequences and do MSA.
aln=${fasta file of protein sequences}
# MSA
mafft --thread 12 --maxiterate 1000 --localpair "${aln}".fasta > "${aln}"_mafft.aln
# trim alignment
java -jar /apps/bmge/1.12/bin/BMGE.jar -m BLOSUM30 -i "${aln}"_mafft.aln -t AA -of "${aln}"_BMGE.fasta
# convert fasta format to phy
python3 /blue/lagard/yuanyifeng/scripts/fasta2phy.py -i "${aln}"_BMGE.fasta -o "${aln}"_BMGE.phy
# tree building
raxml-ng --msa "${aln}"_BMGE.fasta --model LG+G+F --prefix result --seed 123 \
--search replicates=100 --threads 12 --tree pars{100},rand{100}
# bootstrap
raxml-ng --bootstrap --msa "${aln}"_BMGE.fasta --bs-trees 1000 --seed 123 \
--prefix boot --model LG+G+F
# bootstrap comparison
raxml-ng --support --tree result.raxml.bestTree --threads 12 \
--bs-trees boot.raxml.bootstraps --prefix support
# build blastDB
dir_db=${path to faa files of protein sequences}
ls ${dir_db}/*faa | xargs -i makeblastdb -in {} -dbtype prot -parse_seqids -out {}_blastdb
dir_o=outfile_cog
qfaa=${ name query sequences, e.g. AvaS}
mkdir ${dir_o} || true
for db in $(ls ${dir_db}/*faa); do
dbfaaname=$(basename $db)
dbname=${dbfaaname%%.PATRIC.faa}
out=${dir_o}/${dbname}.out
blastp -query "$qfaa" -db "$db"_blastdb -out "$out" \
-outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids salltitles" \
-num_threads 3 -max_target_seqs 10000
done
# tBlastn
# prepare a list of genome id for genomes of each genus selected. Name them "${org}"_gid.txt
dir_o=outfiles_tblastn_"${org}"
mkdir ${dir_o} || true
qfaa= ${ name query sequences }
org= ${ name the genus for searching }
dir_db=${ path to fna files }
## build blastDB
ls ${dir_db}/*fna | xargs -i makeblastdb -in {} -dbtype nucl -out {}_blastdb
for gid in $(cat "${org}"_gid.txt); do
genome="${dir_db}"/"${gid}".fna
makeblastdb -in "${genome}" -dbtype nucl -out "${genome}"_blastdb
out=${dir_o}/${gid}.aln
tblastn -query "$qfaa" -db "${genome}"_blastdb -out "$out" \
-num_threads 6 -max_target_seqs 10 \
-outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids salltitles"
#-outfmt 0
done
mash sketch -p 6 -o abs_sketch -l avaS_abs_gid.txt
mash sketch -p 6 -o pre_sketch -l avaS_pre_gid.txt
mash dist -p 6 abs_sketch.msh pre_sketch.msh > mash_result.txt
# extract mash result
for g in $(cat gid.txt | sed 's/^g//') ; do
grep "${g}.fna" mash_result.txt | sed -E 's#${ path to fna files }#g#g' | sort -k3,3n | sed 's/\.fna//g' | head -1
done
#Clean the MASH result. Then, calculate the distance between genomes so the sum of distance is minimal.
python select_minimal_value_per_gid1.py mash_result_clean.txt mash_res_minimum_by_gidSpne.txt
python select_pair_with_global_minimal_distance.py mash_result_clean.txt mash_res_global_minimum.txt
cut -d$'\t' -f2 mash_result_global_minimum.txt > other_gid.txt
cut -d$'\t' -f2 mash_res_minimum_by_gidSpne.txt >> other_gid.txt
dir_w=${ working directory to a selected group of species/genus such as Aeromonas or Pseudomonas, e.g. working_dir}
# Re-format CDS sequences (ffn files)
mkdir ${dir_w}/ffn_prep || true
# remove CDSs less than 10 amino acids.
# -l, print sequences in lower case.
# -m, print sequences >= 10 aa (33 nts).
gid_list=${list of genome IDs.txt}
for gid in $(cat gid.txt) ; do
seqkit seq -g -l -m 33 ${dir_seq}/${gid%.fna}.ffn > ${dir_w}/ffn_prep/${gid%.fna}.ffn
done
# re-format ffn files.
for gid in $(cat gid.txt) ; do
ffn=${dir_w}/ffn_prep/${gid}.ffn
sed -i '/^>/ s/ .*$/#/g' $ffn # replace space after fig number with #.
sed -i ':a;N;$!ba;s/\n//g' $ffn # delet all line breaks.
sed -i 's/>/\n>/g' $ffn # make each > a new line.
sed -i 's/#/\n/g' $ffn # make # a new line.
sed -i '/^>/! s/^...//' $ffn # remove start codon.
sed -i '/^>/! s/...$//' $ffn # remove stop codon.
sed -i '/^>/! s/.\{3\}/& /g' $ffn # insert space every triplet.
sed -i '/^$/d' $ffn # remove empty lines.
done
# count codons.
for gid in $(cat gid.txt) ; do
ffn=${dir_w}/ffn_prep/${gid}.ffn
python 1_pycodon_count.py ${ffn} ${dir_w}
done
for gid in $(cat gid.txt) ; do
file=${dir_w}/py_count/${gid}.ffn_count.txt
python3 2_count2CDScodon.py ${file}
done
# extract AUA and AUN condon number.
for file in $(ls ${dir_w}/ffn_prep/*\.ffn) ; do
python3 3_count_CDS_AUA_N.py "${file}" "${dir_w}"
done
# summerize AUA and AUN number.
output=output=${ output file, e.g. AUA_number.txt}
echo -e "gid\ttotal_nnn_a\ttotal_nnn_c\ttotal_nnn_g\ttotal_nnn_t\taua_a\taua_c\taua_g\taua_t\taua_a/aua_h" > "${output}"
for file in $(ls ${working_dir}/aua_n_count/*.PATRIC.ffn_aua_n_count.txt); do
gid=g"$(basename "$file" | sed 's/.PATRIC.ffn_aua_n_count.txt//')"
total_aua_a=$(awk -F '\t' '{print $2}' "${file}" | paste -sd+ | bc)
total_aua_c=$(awk -F '\t' '{print $3}' "${file}" | paste -sd+ | bc)
total_aua_g=$(awk -F '\t' '{print $4}' "${file}" | paste -sd+ | bc)
total_aua_t=$(awk -F '\t' '{print $5}' "${file}" | paste -sd+ | bc)
total_nnn_a=$(awk -F '\t' '{print $6}' "${file}" | paste -sd+ | bc)
total_nnn_c=$(awk -F '\t' '{print $7}' "${file}" | paste -sd+ | bc)
total_nnn_g=$(awk -F '\t' '{print $8}' "${file}" | paste -sd+ | bc)
total_nnn_t=$(awk -F '\t' '{print $9}' "${file}" | paste -sd+ | bc)
total_aua_h=$(echo "${total_aua_c} + ${total_aua_g} + ${total_aua_t}" | bc)
ratio_aua_a_over_aua_h=$(echo "scale=6; ${total_aua_a} / ${total_aua_h} " | bc)
echo -e "${gid}\t${total_aua_a}\t${total_aua_c}\t${total_aua_g}\t${total_aua_t}\t${total_nnn_a}\t${total_nnn_c}\t${total_nnn_g}\t${total_nnn_t}\t${ratio_aua_a_over_aua_h}" >> "${output}"
done
# summarize of AUA AUN ratio.
output=${ output file, e.g. AUA_ratio.txt}
echo -e "gid\ttotal_aua\ttotal_auh\ttotal_nnn\tratio_aua_syn_fold\taua_syn_gt_ratio\tratio_aua_fold\taua_gt_ratio\tnum_cds\tnum_cds_gt_aua_syn_ratio_per1000\tnum_cds_gt_aua_ratio_per100" > "${output}"
for file in $(ls ${working_dir}/CDS_codon/*_CDScodon.tsv); do
gid=g"$(basename "$file" | sed 's/.PATRIC.ffn_CDScodon.tsv//')"
total_aua=$(awk -F '\t' 'FNR>1{print $14}' "${file}" | paste -sd+ | bc)
total_auc=$(awk -F '\t' 'FNR>1{print $15}' "${file}" | paste -sd+ | bc)
total_aut=$(awk -F '\t' 'FNR>1{print $17}' "${file}" | paste -sd+ | bc)
total_auh=$(echo "${total_aua} + ${total_auc} + ${total_aut}" | bc)
ratio_aua_syn_fold=$(echo "scale=6; ${total_aua} / ${total_auh}" | bc)
total_nnn=$(awk -F '\t' 'FNR==2{print $73}' "${file}")
ratio_aua_fold=$(echo "scale=6; ${total_aua} / ${total_nnn}" | bc)
aua_syn_gt_ratio=$(awk -F '\t' -v r=${ratio_aua_syn_fold} 'FNR>1&&($14+$15+$17)>0&&$14/($14+$15+$17)>r{print $1}' "${file}" | wc -l)
aua_gt_ratio=$(awk -F '\t' -v r=${ratio_aua_fold} 'FNR>1&&($14/$68)>r{print $1}' "${file}" | wc -l)
num_cds=$(awk -F '\t' 'FNR>1{print $1}' "${file}" | wc -l)
num_cds_gt_aua_syn_ratio_per1000=$(echo "scale=6; ${aua_syn_gt_ratio} / ${num_cds} *1000" | bc)
num_cds_gt_aua_ratio_per100=$(echo "scale=6; ${aua_gt_ratio} / ${num_cds} *100" | bc)
echo -e "${gid}\t${total_aua}\t${total_auh}\t${total_nnn}\t${ratio_aua_syn_fold}\t${aua_syn_gt_ratio}\t${ratio_aua_fold}\t${aua_gt_ratio}\t${num_cds}\t${num_cds_gt_aua_syn_ratio_per1000}\t${num_cds_gt_aua_ratio_per100}" >> "${output}"
done
# 1. prepare GO annotation for each organism in EXCEL and save as xxx_go_bvbrc.txt
# convert to .gmt format
python convert_to_gmt.py Abmn_go_bvbrc.txt Abmn_go.gmt
python convert_to_gmt.py Aero_go_bvbrc.txt Aero_go.gmt
python convert_to_gmt.py PAO1_go_bvbrc.txt PAO1_go.gmt
python convert_to_gmt.py Sone_go_bvbrc.txt Sone_go.gmt
python convert_to_gmt.py Vcho_go_bvbrc.txt Vcho_go.gmt
# 2. prepare Gene list sorted with aua synonmus ratio.
for file in $(ls *.PATRIC.ffn_CDScodon.tsv); do
output=${file%%.PATRIC.ffn_CDScodon.tsv}
awk -F '\t' 'FNR>1{if ($14+$15+$17>0) {print $1"\t"$14/($14+$15+$17)} else {print $1"\t"0}}' "${file}" | sort -k2,2nr > "${output}"_CDS_aua_syn_sort.txt
done
for file in $(ls *_CDS_aua_syn_sort.txt); do
# remove some |foo bar following fig ID
sed -i 's/^>fig|/#/' "${file}"
sed -i 's/|.*\t/\t/' "${file}"
sed -i 's/^#/fig|/' "${file}"
# replace | and . with _ in fig ID.
sed -i 's/|/_/' "${file}"
sed -i 's/\./_/' "${file}"
sed -i 's/\.peg\./_peg_/' "${file}"
done
# 3. prepare Gene list sorted with aua number
for file in $(ls *.PATRIC.ffn_CDScodon.tsv); do
output=${file%%.PATRIC.ffn_CDScodon.tsv}
awk -F '\t' 'FNR>1{print $1"\t"$14}' "${file}" | sort -k2,2nr > "${output}"_aua_num_sort.txt
done
for file in $(ls *_aua_num_sort.txt); do
# remove some |foo bar following fig ID
sed -i 's/^>fig|/#/' "${file}"
sed -i 's/|.*\t/\t/' "${file}"
sed -i 's/^#/fig|/' "${file}"
# replace | and . with _ in fig ID.
sed -i 's/|/_/' "${file}"
sed -i 's/\./_/' "${file}"
sed -i 's/\.peg\./_peg_/' "${file}"
done
#set qt path to the result above
# for example
export QT_QPA_PLATFORM_PLUGIN_PATH=/usr/lib/x86_64-linux-gnu/qt5/plugins/platforms
# 3 Run gseapy preranked analysis
# run the python3 script below with sort.txt file and corresponding bvbrc go.gmt file.
# ranked gene list. column 1 gene id , column 2 value such as Foldchange, aua number
# for gsea analysis of CDSs based on the AUA codon number in Vibrio cholerae
output = ${ path to output file, e.g. aua_num_Vcho }
python3 gseapy.py 666.4624_aua_num_sort.txt Vcho_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA frequence among isoleucine codons in Vibrio cholerae
output = ${ path to output file, e.g. aua_syn_ratio }
python3 gseapy.py 666.4624_CDS_aua_syn_sort.txt Vcho_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA codon number in Acinetobacter baumannii
python3 gseapy.py 470.11567_aua_number_sort.txt Abmn_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA frequence among isoleucine codons in Acinetobacter baumannii
python3 gseapy.py 470.11567_CDS_aua_syn_sort.txt Abmn_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA codon number in Aeromonas hydrophila strain OnP3.1
python3 gseapy.py 644.638_aua_number_sort.txt Aero_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA frequence among isoleucine codons in Aeromonas hydrophila strain OnP3.1
python3 gseapy.py 644.638_CDS_aua_syn_sort.txt Aero_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA codon number in Shewanella oneidensis MR-1
python3 gseapy.py 211586.12_aua_number_sort.txt Sone_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA frequence among isoleucine codons in Shewanella oneidensis MR-1
python3 gseapy.py 211586.12_CDS_aua_syn_sort.txt Sone_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA codon number in Pseudomonas aeruginosa PAO1
python3 gseapy.py 208964.12_aua_number_sort.txt PAO1_go.gmt ${output}
# for gsea analysis of CDSs based on the AUA frequence among isoleucine codons in Pseudomonas aeruginosa PAO1
python3 gseapy.py 208964.12_CDS_aua_syn_sort.txt PAO1_go.gmt ${output}