Skip to content

Intro to ML for neurogenomics

Alan Murphy edited this page Feb 6, 2024 · 14 revisions

General machine learning skills

We recommend that you get familiar with all the below skill sets:

Coding:

  • While the primary language used within the lab is R, for ML work, you will need to use Python
  • You will need to become familiar with deep Learning frameworks (Pytorch and possibly Tensorflow/Keras)
    • Some important models in the field (e.g. Enformer) were developed using TensorFlow, so if you will be fine tuning that model, you'll need to use TensorFlow... although apparently a student in the Gagneur lab has now got it working in PyTorch so that may be available in the future.
    • However, it is best to focus on learning PyTorch. Most new ML methods now appear in pytorch first as this is where most of the field are working. PyTorch gives more control and is just a bit nicer to use (although there is a slightly steeper learning curve with it).
  • Know how to create and efficiently use data loaders:
    • The dataloader is the function to load batches of training samples at a time.
    • There are some out-of-the-box ones but these never cover all the use cases. For example, the one Alan wrote for EC was >1000 lines long
    • Honestly the advice for it is framework specific and often very sparse. Here is an okay one for pytorch - https://www.learnpytorch.io/04_pytorch_custom_datasets/. Probably better to look at genomic applications where they are applied to see better ones
  • Know of different model architectures, when to use and how to use them:
    • From U-Nets to attention, there are always new techniques being applied in genomics. You should know of the most popular and how and when to use them.

How to efficiently use GPUs for model training:

  • Linux programming skills
  • Distributed training with multiple GPUs & physical data location in relation to the GPUs
  • Learn to use the GPUs on the Imperial HPC
  • Learn to use the GPUs on our private cloud

Model training monitoring:

  • weights and biases is an important tool for monitoring how your training runs are progressing. It can work for both PyTorch and TensorFlow and you should get familiar with it's usage.

Once you have progressed with this:

  • Get used to using Enformer Celltyping
    • clone the repo on your personal folder on the HPC and then have a go working through the install instructions including creating a conda environment and running the using enformer celltyping tutorial first in an interactive jupyter notebook on the HPC then take part of the code and submit it as a job to the cluster so you get used to it.
  • Try running the BXD code

Machine learning for genomics

How to propose the problem and correctly create a training, validation and test set to show appropriate performance and avoid data leakage. Jacob Schreiber has a good paper on this: Navigating the pitfalls of applying machine learning in genomics

Important papers in the field

The Kipoi repository collects trained models in genomics. They also have a seminar series of machine learning in regulatory genomics which is worth signing up for.

It’s worth reading this recent review paper but note Anshul Kundaje’s comments on it here. An older but well written review is here: https://www.nature.com/articles/s41576-019-0122-6

There have been a few DREAM challenges, where defined problems were set and people competed at them:

  • ENCODE Dream challenge. Only a small fraction of all possible TF-cell type pairs have been profiled. One solution is to build machine learning models trained on currently available epigenomic data sets that can be applied to the remaining missing pairs. The challenge focused on around 20 TFs each of which has data from around three/four cell types. They competed to find the best model at predicting expression in new cell types, which they were not trained on.

  • ENCODE Imputation challenge. The goal of the ENCODE Imputation Challenge is a systematic evaluation and benchmarking of computational methods for imputing biochemical data associated with functional genomic elements produced by various types of genomics assays. The challenge will be carried out in parallel with ENCODE’s ongoing data generation efforts, thereby allowing truly prospective validation of methods on newly acquired data sets.

Labs in the field

This is a brief set of notes thrown together but some of those who work in the area include:

Access to GPUs

We have access to three main sets of GPUs:

  • Our private cloud with 1x A100 GPU and an NVideo RTX 2080 Ti
  • The Imperial HPC (there is a queue of A10's and also a queue for A100's that's in beta testing that access can be requested for)
  • Payam's cluster of 20 A100 GPUs and over 3000 CPU cores

Our recommended approach is:

  • Test changes and small runs on the private cluster GPU so that you get instant results
  • If multiple GPUs are needed swap to submitting the 'full' job on the HPC.
  • Caveat to this is that the HPC has a pretty short time limit (72 hours) so if you need to train for longer, use the GPU on the private cluster

While the 20 GPU cluster is powerful, it is run on kubernetes system and currently requires that you use jupyter notebooks with it which has a quite large overhead over running a python file. Importantly, the system also resets every night so the max you could run for is 24 hrs. It may be worth returning to check this out though and we encourage you to check to see if this is still the case.

Clone this wiki locally