-
Notifications
You must be signed in to change notification settings - Fork 100
Methylation mixupmapping
This is a focused cookbook for performing the methylation mixup mapping. This cookbook illustrates only a part of the capabilities of our software. See for other options the other parts of the wiki.
##Manual contents
- Step 1 - Preparation of methylation data
- Step 2 - Preparation of genotype data
- File Checklist
- Step 3 - MixupMapper
##Downloading the software and reference data You will need four programs to perform the QTL analysis.
- The software to process and convert the genotype data, please use version 1.4.9 available here: Genotype Harmonizer.
- The actual QTL mapping software. You can download version 1.2.4D of the software from the FTP, or you can find it here: eqtl-mapping-pipeline-1.2.4D-SNAPSHOT-dist.zip or eqtl-mapping-pipeline-1.2.4D-SNAPSHOT-dist.tar.gz
- The annotation file for the 450K probes, which can be found here.
##Before you start
In this manual we assume that you have basic knowledge on how to work with the java Virtual Machine (VM) and the command line. If you are not sure about working command line, please have a look at our introduction page and/or check the information about the Java VM (including the Java memory options). If you run into any problems with heapspace or out of memory errors please supply the -Xmx
and -Xms
command, see Java VM information. At each java command we supply the recomended settings.
##Definitions Througout the manual, references to different files/folders will be made. Here is an overview of these items:
- The methylation file will be referred to as
traitfile
- The methylation annotation file will be referred to as
annotationfile
- The full path of your genotype data will be referred to as
genotypedir
- The file linking methylation id to genotype id will be referred to as
genotypemethylationcoupling
- The file containing covariates will be defined as
covariatefile
Descriptions of each of these files and their usage is detailed below, and their formats are described in the data formats section which can be found in the general QTL mapping documentation.
##Step 1 - Preparation of methylation data As our software uses a nonparametric test by default, you can use virtually any continuous data as trait values to map a variety of QTL effects. However, currently the normalization tools provided with this package are focused on array based (Illumina) expression data, preprocessed RNA-seq data (e.g. transcript level quantified data), (GC)-RMA processed Affymetrix data and include several steps for the 450K methylation array. Therefore we first need to perform a part of the normalization in a separate program.
Initial processing of the idat files to an expression matrix can be done using alot of different methods. This if of choice, we recommend DASEN. Information on our default processing can be found here.
Before mixupmapping we need to normalize the data, for the methylation data we will perform the following steps:
- Quantile normalization
- Probe centering and scaling (Z-transformation): (MethylationProbe,Sample = MethylationProbe,Sample - MeanProbe) / Std.Dev.Probe
Run these steps using the following command:
java -Xmx15g -Xms15g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile --out outdir --qqnorm --centerscale
After these steps a tab-separated gzipped plaintext file is created with an identical number of rows and columns as the input file. This means these files can subsequently be used during meQTL mapping.
###Check your data
Running the general normalization procedure yields a number of files in the directory of your traitfile
, or in the outdir
if you specified one. The default procedure will generate file suffixes listed below. Suffixes will be appended in the default order as described above. Selecting multiple normalization methods will add multiple suffixes. Please do not remove or replace ANY intermediate files as they may be used in subsequent steps.
Suffix | Description |
---|---|
QuantileNormalized | Quantile Normalized trait data. |
ProbesCentered | Probes were centered. |
SamplesZTransformed | Samples were Z-transformed. |
##Step 2 - Preparation of genotype data Our software is able to use both unimputed called genotypes, as well as imputed genotypes and their dosage values. The software, however, requires these files to be in TriTyper format. We provide Genotype Harmonizer to harmonize and convert your genotype files.
###Convert data to the TriTyper fileformat. For the QTL mapping the data needs to be in TriTyper format. Using GenotypeHarmonizer one can harmonize and convert genotype formats. You can either run this step per each chromosome separately or for all chromosomes at once.
❗ Important When you are using imputed genotypes, make sure to use as an input format a file that contains the probabilities or dosages and not just genotype calls. In case of doubt please contact bonder.m.j @ gmail.com
📌 Note: please make sure to use verion 1.4.9 of the Genotype Harmoinzer.
java -Xmx10g -Xms10g -jar ./GenotypeHarmonizer.jar -i {locationOfInputData} -I {InputType} -o {Outputlocation} -O TRITYPER -cf 0.95 -hf 0.0001 -mf 0.01 -mrf 0.5 -ip 0
Input arguments: -i
location of the input data, -I
input type, -o
output location, -O
output type TRITYPER. See the Genotype Harmonization manual for further details on the input flags.
✅ Please upload your logs file to the FTP server.
If you ran the GenotypeHarmonizer per chromosome you now need to merge all TriTyper folders per chromosome to one TriTyper folder containig data from all chromosomes. By using the following command you merge the individual TriTyper folders.
java -Xmx2g -Xms2g -jar ./eqtl-mapping-pipeline.jar --imputationtool --mode concat --in {folder1;folder2;folder3;ETC} --out {OutputFolder}
Input arguments:--in
is a semicolon-separated list of input TriTyper Folders with information per chromosome, --out
is the output location of the merged TriTyper data. Note: in case of problems try putting quotes ("
) around your list of input folders
Great! You now have your genotype and methylation data ready to go!
#File Checklist Before continuing, now is a good time to check whether your files are in the correct format and whether you have all the required files ready. Please check the following:
- All required TriTyper files are in your
genotypedir
. - The methylation files are formatted properly.
- You have an
annotationfile
. - The identifiers of samples are identicial in your methylation and genotype files. If not, you should make a file to link the proper samples. This file is called a
genotypemethylationcoupling
file, the format is described here: Genotype - phenotype coupling. This file is not necessary if the names are already matched.
##Step 3 - MixupMapper We have shown in a paper published in Bioinformatics (Westra et al.: MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects), that sample mix-ups often occur in genetical genomics datasets (i.e. datasets with both genotype and methylation data). Therefore, we developed a method called MixupMapper, which is implemented in the QTL Mapping Pipeline. This program performs the following steps:
- At first a cis-meQTL analysis is conducted on the dataset:
- using a 250 kb window between the SNP and the mid-probe position
- performing 10 permutations to control the false discovery rate (FDR) at 0.05
- Calculate how well the methylation data matches the genotype data. For details how this exactly works, please have a look at the paper.
###Preparations/variables
- Note down the full path to your TriTyper genotype data directory. We will refer to this directory as
genotypedir
. - Determine the full path of the methylation data produced in step 3, the file ending with "SamplesZTransformed.txt.gz". We will refer to this path as
traitfile
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which will be GPL13534 for the methylation 450K array. - (Optional) Locate your
genotypemethylationcoupling
if you have such a file. You can also use this file to test specific combinations of genotype and phenotype individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
###Commands to be issued The MixupMapper analysis can be run using the following command:
java -Xmx15g -Xms15g -jar eqtl-mapping-pipeline.jar --mode mixupmapper --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform GPL13534 --inexpannot {annotationfile} --testall (--gte {genotypephenotypecoupling}) 2>&1 | tee {outdir}/mixupmapping.log
If your genotypes and/or methylation data include more samples than those who are matched based on sample names, it is beneficial to test all possible combinations. This can be performed using the following command line switch --testall
.
If you are running the software in a cluster environment, you can specify the number of threads to use (nrthreads
) by appending the command above with the following command line switch --threads nrthreads
(nrthreads should be an integer).
Note that the --gte
, --threads
and --snps
flags are optional, --gte
only applies if you are using a genotypephenotypecoupling
file.
###Check your data
MixupMapper is a two stage approach. As such, the default procedure creates two directories in the outdir
you specified: cis-meQTLs and MixupMapping. Both folders contain a different set of output files, described below.
####cis-meQTLs directory This directory contains output from a default cis-meQTL mapping approach. However we are not interested in this folder, these files are just necessary to create the actual mixupmapping output which can be found in the MixupMapping directory.
####MixupMapping directory
File | Description |
---|---|
Heatmap.pdf | Visualization of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the methylation samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colours. Samples are sorted alphabetically on both axes. |
BestMatchPerGenotype.txt | This file shows the best matching trait samples per genotype: the result matrix (MixupMapperScores.txt) is not symmetrical. As such, scanning for the best sample per genotype may yield other results than scanning for the best sample per trait. |
BestMatchPerTrait.txt | This file shows the best matching genotype sample per trait sample. |
MixupMapperScores.txt | A matrix showing the scores per pair of samples (combinations of traits and genotypes). |
In the BestMatchPerGenotype.txt and BestMatchPerTrait.txt files, you can find the best matching trait sample for each genotyped sample and vice versa:
- 1st column = genotyped sample ID, or trait sample ID dependent on file chosen (see above)
- 2nd column = trait sample originally linked to genotype sample ID in column 1, or genotype sample originally linked with trait sample in column 1
- 3rd column = the MixupMapper Z-score for the link between the samples in column 1 and 2
- 4th column = best matching trait (for BestMatchPerGenotype.txt) or best matching genotype (for BestMatchPerTrait.txt)
- 5th column = the MixupMapper Z-score for the link between the samples in column 1 and 4
- 6th column = this column determines whether the best matching trait or genotype is identical to the sample found in column 2
Example of BestMatchPerGenotype.txt
Trait OriginalLinkedGenotype OriginalLinkedGenotypeScore BestMatchingGenotype BestMatchingGenotypeScore Mixup Sample1 Sample1 -11.357 Sample1 -11.357 false Sample2 Sample2 -15.232 Sample2 -15.232 false Sample3 Sample3 -3.774 Sample4 -6.341 true Sample4 Sample4 -3.892 Sample3 -12.263 true
In case of the example there are two mixups indentified. If you don't have any mix-ups you are done, otherwise you need to resolve or remove these mix-ups. Please check this separate document for instructions on fixing mixups.
Always rerun the Sample Mix-up Mapper after removing or changing sample IDs and check if you actauly fixed/removed mixups instead of creating more!
- QTL mapping pipeline
- Genotype Harmonizer
- Genotype IO
- ASE
- GADO Command line
- Downstreamer
- GeneNetwork Analysis
Analysis plans
Other