For this project you will use a masked language model (MLM) (Gankin et al., 2023) trained on mammalian 3'UTR regions.
3'UTR regions have important roles in mRNA regulation, some of these roles are concisely described in (Hong and Jeong, 2023).
In particular, you will predict to what extent the MLM is able to reconstruct human RNA binding protein (RBP) motifs (or their binding sites), similarly to Fig. 2 in (Gankin et al., 2023).
RBP binding sites can be selected based on (Dominguez et al., 2018). In particular, you may want to consider the top 5-mer motif for each protein listed in Table S3. You could start with proteins with significant overlap in the 5-mers comprising the RBNS and eCLIP logos (Fig. S3).
For each selected motif, you will answer the following questions:
- Is the masked nucleotide probability for the motif is higher than for a random 5-mer?
- Is the MLM gives higher masked nucleotide probability than simple 11-mer and dinucleotide models?
Basically, you will need to reproduce Fig. 2 from (Gankin et al., 2023) but using human RBP motifs instead of yeast motifs.
Additional research questions may include:
- Are high complexity motifs easier to reconstruct?
- Are stronger motifs (Stepwise_R-1>0.1) reconstructed better?
- Would MLM training on aligned sequences (i.e. Whole Genome Alignment) improve predictions?
- ...
These questions will be discussed at later stages of the project.
The code from the current repository can be used to run the mammalian MLM. This is a simplified version of the original repository adapted to the mammalian MLM model.
The required dependecies can be installed via conda:
conda env create -f environment.yml
conda activate ML4RG-mlm
The following command can then be used to test the MLM ability to reconstruct masked nucleotides in human 3'UTR regions:
python main.py --test --fasta Homo_sapiens_3prime_UTR.fa --species_list 240_species.txt \
--output_dir ./test --model_weight MLM_mammals_species_aware_5000_weights
Files Homo_sapiens_3prime_UTR.fa
and MLM_mammals_species_aware_5000_weights
will be provided to you with a separate link.
Note that the inference time improves significantly when running on GPU.
You may need to examine the original repository to implement motif-related evaluation which is necessary to successfully accomplish the project.