It is necessary to analyze the evolution in the history of the T2R38 gene (i.e., taste 2 receptor 38), which is involved in the perception of bitter taste.
Bitter taste is related to compounds called glucosinolates, found in certain plants.
The evolution of this gene across different species is linked to their dietary preferences.
Thus, we know why the gene evolved; however, we do not yet know how. For this reason, it is necessary to build an automated process—using the algorithms studied—capable of performing a phylogenetic analysis of T2R38.
First, two inputs are considered: (
The initial phase of the analysis involves reading a sequence in FASTA format. With the first function read_sequence_file, validity checks are implemented on the input file sequence.fasta and its format, with proper exception handling.
Once the input file is read, it is necessary to determine whether it contains DNA or a protein sequence. For this, the function determine_sequence_type is called, which returns the string DNA if more than 90% of the characters read correspond to nucleotides.
In the context of our input file, it is easy to observe that the sequence is DNA. Since phylogenetic analysis requires a protein sequence, the DNA must be translated. This is done by calling the function find_best_reading_frame, which examines all six possible reading frames (3 forward and 3 reverse) to identify the longest ORF.
DNA can be read in six different frames—three in the forward direction (starting at positions 0, 1, or 2) and three in the reverse-complement direction (using the reversed complementary sequence, also starting at positions 0, 1, or 2).
For each frame, the function translates the DNA sequence into amino acids using the standard genetic code through protein = seq_obj[frame:].translate() and rev_protein = seq_obj.reverse_complement()[frame:].translate(). After translation, the resulting protein (from either the forward or reverse frame) is split according to stop codons (i.e., *) to finally choose the longest protein sequence (i.e., longest = max(orfs, key=len)).
In the frame producing the longest ORF, the first methionine (i.e., start codon) is then identified, assuming this represents the true beginning of the protein. At this point, the obtained sequence can be considered the most likely to represent the actual coded protein.
The translation from DNA to protein is crucial for phylogenetic analysis for both biological and technical reasons. Protein sequences tend to be more evolutionarily conserved than DNA sequences. This is because the genetic code is degenerate (multiple codons can encode the same amino acid), so many mutations at the DNA level are "silent" and do not change the resulting amino acid. Comparing proteins therefore allows for the detection of more distant evolutionary relationships.
In other words, evolutionary "noise" in DNA sequences is higher (the nucleotide substitution rate is higher than the amino acid substitution rate), which can obscure true phylogenetic relationships, especially between organisms that diverged a long time ago.
Once the protein sequence is obtained, the program performs a BLAST search to identify homologous sequences in different species. The function run_blast sends the sequence to the NCBI server and retrieves the results. Subsequently, process_blast_hits filters the BLAST results to select sequences from different species.
At this point, a multiple sequence alignment of the selected sequences is performed using Clustal Omega via the function run_multiple_sequence_alignment.
Finally, the function visualize_msa creates a graphical representation of the multiple sequence alignment, facilitating the visual interpretation of conserved and variable regions among the sequences.