- HIV-1 is classified into 4 groups: M, N, O, and P, and of these groups, group M is the most widespread and clinically relevant.
- Group M is subdivided into 9 distinct subtypes: A, B, C, D, F, G, H, J, K.
- 2 or more subtypes can combine to form a hybrid form called a circulating recombinant form (CRFs).
- Rates of disease progression vary among HIV-1 subtypes, with some subtypes showing increased drug resistance and virulence.
- Understanding the distribution of HIV-1 subtypes is crucial for the development of vaccines and the clinical management of HIV-1 infections. For infected individuals, classifying HIV-1 subtype is a crucial step in clinical infection management.
- HIV-1 subtyping = classifying HIV-1 into subtypes.
- Most HIV-1 subtype classification methods rely on aligning an input sequence against a set of pre-defined subtype reference sequences. These alignment-based methods are computationally expensive, especially for long sequences and rely on ad hoc parameter settings, which can limit reproducibility.
- To address the limitations of alignment-based methods, alignment-free methods have been developed.
- Few studies (if any) have compared DNA vectorization methods for HIV-1 subtyping.
- Few studies (if any) have applied multi-task learning to HIV-1 subtype classification.
- To characterize the effect of DNA sequence vectorization methods on HIV-1 subtyping.
- Develop an improved HIV-1 subtype classification method.
-
Explore different ways to vectorize DNA sequences:
- Word2Vec
- K-mers (of varying length)
- Subsequence natural vector: includes number and distrubution of nucleotides, position, and variance
- Natural vector: number and distribtion of nucleotides
- Other language model encodings
-
Feature selection:
- PCA
-
Machine Learning & Deep Learning
- Classical Machine Learning
- SVM (one-vs-rest)
- Multi-class logistic regression
- XGBoost
- LASSO
- Naive Bayes
- KNN clustering
- Deep Learning
- 1D-CNN
- Classical Machine Learning
-
Evaluation Metrics
- Confusion Matrix
- Accuracy, precision (macro-precision), recall, F1-score (macro F1-score)
- AUROC, AUPRC
- Cohen's Kappa
hiv-db.tar.xz: contains 20,386 unprocessed sequences from 289 subtypes LANL HIV Sequence Database- After processing there are 15,018 sequences and 28 subtypes.
- Each sequence is approx. 10,000 nucleotides.
hiv.txt: the sequences (atgcgctagatcga)- Each line corresponds to one sequence
labels.txt: the subtype label- The first line in
labels.txtcorresponds to the first sequence (line) inhiv.txt.
- The first line in
pre-process.py: readshiv-db.tar.xz- Removes sequences with unknown characters (eg: N) and subtypes with too few examples
- Creates
hiv.txtandlabels.txt
hiv.txt can be used to vectorize the sequences.
-
classify.py: take X and y as inputs, where X is a 2D feature vector and Y is a vector- Before runing the codes, please first set configs, including output directory, possible hyper-parameters (optional)
- Output: Confusion metrices and numerical results, which will be saved in your output directory.
-
feature_selection.ipynb: script for dimensionality reduction (PCA, LR-Lasso, 1D-CNN). Specific instructions can be found in the ipython notebook.