This repository houses my bioinformatics projects including steps and notes about each one.
In addition, this README has short explanations or introductory sessions about each topic and project.
The projects include:
- Pairwise Sequence Alignment
- Multiple Sequence Alignment
- Phylogenetic Tree Reconstruction
- BLAST Search Engine
- Quality Control of Next Generation Sequencing Experiments
- NGS: Computational Analysis
- Mapping of Reads to Genome
- Variant Calling
- Unsupervised Machine Learning
- Using K-means in Weka
- KNN and Decision Tree Algorithms
- Clustering with a Genetic Algorithm
- Artificial Neural Networks
Pairwise Sequence Alignment 🛸 is a method used in bioinformatics to compare two sequences of DNA, RNA, or proteins. It is a type of sequence alignment used to identify similar regions that may indicate relationships between two biological sequences. This relationship could be functional, structural, evolutionary, or a combination of two or more of these relationships. The alignment can be global, comparing the sequences from beginning to end, or local, focusing on finding regions of high similarity within longer sequences. By introducing gaps in the sequences, pairwise alignment algorithms, such as Needleman-Wunsch for global alignment and Smith-Waterman for local alignment, maximize the match scores based on predefined scoring matrices. These matrices take into account the likelihood of substitutions, insertions, and deletions, helping to accurately reflect the biological significance of the alignments.

Multiple Sequence Alignment (MSA) is a method used in bioinformatics to align three or more biological sequences (DNA, RNA, or proteins) simultaneously. The aim is to identify regions of similarity across all sequences, which can provide insights into their functional, structural, or evolutionary relationships. MSA is more complex than pairwise alignment due to the increased number of comparisons and the need to manage gaps across multiple sequences. Common algorithms for MSA include ClustalW, ClustalOmega, MUSCLE, and MAFFT, which use various strategies to optimize the alignment by balancing computational efficiency with the biological relevance of the results. MSA is crucial for tasks such as phylogenetic analysis, identification of conserved motifs, and predicting the secondary or tertiary structure of proteins.
Next Generation Sequencing (NGS) is a modern form of sequencing technology that allows for the multi-parallel sequencing of thousands and millions of genes. This technology is highly throughput and enables the sequencing of entire genomes and exomes. NGS is a 'magical' upgrade when compared to Sanger sequencing, which can only sequence one DNA fragment at a time. Pretty lame, right ?😐. Well, maybe not. Anyway, NGS is crucial for genomics research, clinical diagnostics, evolutionary biology, and even personalized medicine due to its ability to generate lage amounts of genetic information, and generate this efficiently.
image credit: iRepertoire.Inc (n.d.).
Phew!!! That's a lot of read, I know 🥹. But your are not done. Keep reading, scholar 🤓 . we have to learn as much as we can 🦾
This is a popular method often applied in Bioinformatics for the analysis and interpretation of complex biological data. It's application is in different areas of bioinformatics and molecular biology, namely:
- Gene Expression Analysis: K-means can be used to analyze gene expression data from microarrays. RNA-Seq data can also be clustered to find patterns in gene expression and this can help to identify genes involved in the same biological process or pathways.
- Protein Sequence Analysis: K-means clustering can be applied to metabolomics data to group samples based on their metabolite profiles. THis helps in the identification of metabolic signatures associated with different biological conditions or diseases.
- Genomics and Phylogenetics : K-means clustering can be applied to genomic data to identify patterns in genetic variants. This helps in understanding the genetic basis of diseases and traits. K-means can also cluster sequences of DNA, RNA, or Proteins in the identification of evolutionary relationships. By grouping similar sequences, researchers can infer phylogenetic trees and understand evolutionary lineage.
- Other applications: Proteomics, Medical Imaging, Single-cell RNA Sequencing

