We develop a new GAN-based model, namely EnHiC, to enhance the resolution of Hi-C contact frequency matrices. Specifically, we propose a novel convolutional layer Decomposition & Reconstruction Block which accounts for the non-negative symmetric property of Hi-C matrices. In our GAN framework, the generator extracts rank-1 features from different scales of low-resolution matrices and predicts the high-resolution matrix via subpixel CNN layers.
Hu, Yangyang, and Wenxiu Ma. "EnHiC: learning fine-resolution Hi-C contact maps using a generative adversarial framework." Bioinformatics 37.Supplement_1 (2021): i272-i279.
- Keep updating the document and cleaning code
- Fix the format in GitHub markdown
- Clean and optimize the model
Anaconda Pyrhon
We provide a Conda environment for running EnHiC, use the environment.yml file, which will install all required dependencies:
conda env create -f environment.yaml
Activate the new environment:
conda activate env_EnHiC
As we described in the paper, our model require the input samples are symmeric.
The EnHiC divides the Hi-C matrix into non-overlapping sub-matrices in size of
. and then pick out 3 sub-matrices(2 on the diagonal and 1 off diagonal) to combine one sub-matrix in size of
. For example, the 2 sub-matrices on the diagonal are
and
and the off diagonal sub-matrix is
so that all
sub-matrices are symmetric. In this section, we select n=400 for EnHiC.
We provide funtions in utils/operations.py, more details please check the demo test.
operations.divide_pieces_hic(hic_matrix, block_size=400, max_distance=None, save_file=False, pathfile=None)
operations.merge_hic(hic_lists, index_1D_2D, max_distance=None)
We provide the trained model for data Rao2014-GM12878-MboI-allreps-filtered.10kb.cool at ./pretrained_model/ with 3 sizes (80, 200, 400)
Example
from EnHiC import model
gan_model_weights_path = os.path.join('.', 'pretrained_model', 'gen_model_400', 'gen_weights')
Generator = model.make_generator_model(len_high_size=400, scale=4)
Generator.load_weights(gan_model_weights_path)
Or you can train your own model.
Training We provide the API function for training data in EnHiC/fit.py
def train(train_data, valid_data, len_size, scale, EPOCHS, root_path='./', load_model_dir=None, saved_model_dir=None, log_dir=None, summary=False)
train_data: Tensor in format of tensorflow Dataset (None, len_size, len_size, 1) e.g.(None, 400, 400, 1)
valid_data: Tensor in format of tensorflow Dataset (None, len_size, len_size, 1) e.g.(None, 400, 400, 1)
len_size: default: 400. The size of sample must be multiples of 4. e.g. 100, 200, 400.
scale: the scale of resolution to enhance, e.g. 40kb->10kb: 4, 100kb->10kb: 10
EPOCHS: number of steps, e.g. 300
root_path: the directory that to save/load model and log
load_model_dir: the directory to load exsiting model to continue the training, default is None which means to build a new one
saved_model_dir: the directory to save model, if it None. The model will be save to root_path/saved_model/gen_model_[len_size]/gen_weights or root_path/saved_model/dis_model_[len_size]/dis_weights
log_dir: the directory to save model, if it None. The model will be save to root_path/logs
summary: print the summary of model, default is False
from EnHiC import fit
fit.train(train_data=train_data, valid_data=valid_data,
len_size=400, scale=4, EPOCHS=300,
root_path='./', load_model_dir=None, saved_model_dir=None, log_dir=None,
summary=True)
Prediction We provide the API function for prediction in EnHiC/fit.py
def predict(model_path, len_size, scale, ds)
model_path: the directory to load exsiting generator model e.g. saved_model/gen_model_[len_size]/gen_weights
len_size: default: 400. The size of sample must be multiples of 4. e.g. 100, 200, 400.
scale: the scale of resolution to enhance, e.g. 40kb->10kb: 4, 100kb->10kb: 10
ds: Tensor in format of tensorflow Dataset (None, len_size, len_size, 1) e.g.(None, 400, 400, 1)
from EnHiC import fit
mpath = os.path.join(root_dir,'saved_model', 'gen_model_{}'.format(len_size), 'gen_weights')
fit.predict(model_path = mpath, len_size=400, scale=4, ds)
We log training data and visualize it by TensorBoard.
tensorboard --logdir=[path-to]/EnHiC/logs/model/
or
tensorboard --logdir=[path-to]/EnHiC/logs/model/ --port=${port} --host=${node} --samples_per_plugin images=50
We shows the Demo based on Rao2014-GM12878-MboI-allreps-filtered.10kb.cool (same in our paper, around 1.5Gb)
Data preprocessing The script test_preprocessing.py prepares the dataset for training. if file doesn't exsit, the script will download it from MIT Hi-C database to [path-to]/EnHiC/data/raw/Rao2014-GM12878-MboI-allreps-filtered.10kb.cool
Then the script will call [path-to]/EnHiC/EnHiC/prepare_data.py to divide the Hi-C matrix into samples in the size of within the genomice_distance. The samples are saved at
[path-to]/EnHiC/data/ .
Usage:
test_preprocessing.py [chromosome] [len_size] [genomic_distance] chromosome: the index of chromosome. e.g. 1, 2, 3, ... , 22, X
len_size: default: 400. The size of sample must be multiples of 4. e.g. 100, 200, 400.
genomic_distance: default 2000000 (2Mb)
Example:
> (env_EnHiC)>> python test_preprocessing.py 1 400 2000000
> (env_EnHiC)>> python test_preprocessing.py 22 400 2000000
Training The script test_train.py trains the dataset. The EPOCHS, BATCH_SIZE and chromosome list for training and validation are all configured in the script. It calls fit.train after loading training data. As a demo, EPOCHS=100, BATCH_SIZE=9, train_chr_list=['22']
Usage:
test_train.py [len_size] [genomic_distance]
len_size: default: 400. The size of sample must be multiples of 4. e.g. 100, 200, 400.
genomic_distance: default 2000000 (2Mb)
Example:
>> conda activate env_EnHiC
> (env_EnHiC)>> python test_preprocessing.py 22 400 2000000
> (env_EnHiC)>> python test_train.py 400 2000000
Prediction The script test_predict.py shows the demo to predict Hi-C low resoltion by EnHiC.
-
Load 10kb Hi-C from cool file
-
Downsample 10kb to 40kb
-
Divide into samples in the size of
$( len_size \times len_size)$ within the$genomice_distance$ -
Predict low resolution Hi-C samples
-
Combine the samples back into one matrix
Usage:
test_predict.py [chromosome] [len_size] [genomic_distance]
chromosome: the index of chromosome. e.g. 1, 2, 3, ... , 22, X
len_size: default: 400. The size of sample must be multiples of 4. e.g. 100, 200, 400.
genomic_distance: default 2000000 (2Mb)
Example:
>> conda activate env_EnHiC
> (env_EnHiC)>> python test_predict.py 22 400 2000000