This project is currently at an experimental stage and has not been validated by an experienced geneticist. Please use at your own risk.
With the advent of stricter data privacy laws in many jurisdictions, it has become impossible for some researchers to use the Michigan Imputation Server to phase and impute genotype data. This project allows you to use the open source code behind the server on your local workstation or high-performance compute cluster, together with the 1000 Genomes Phase 3 v5 reference, and the rest of the ENIGMA Imputation Protocol, .
This document assumes that you have either Singularity or Docker installed on your system.
The container comes with all the software needed to run the imputation protocol. You can either do this manually based on the official instructions, or use the step-by-step guide below.
The guide makes some suggestions for folder names (e.g. mds, raw, qc), but these can also be chosen freely. The only exceptions to this rule are the folders cloudgene, hadoop, downloads, input and output inside the working directory (in /data inside the container).
-
You need to download the container file using one of the following commands. This will use approximately one gigabyte of storage.
Container platform Version Command Singularity 3.x wget http://download.gwas.science/singularity/imputation-protocol-latest.sifDocker docker pull gwas.science/imputation-protocol:latest -
You will now need to create a working directory that can be used for intermediate files and outputs. This directory should be empty and should have sufficient space available. We will store the path of the working directory in the variable
working_directory, and then create the new directory and some subfolders.Usually, you will only need to do this once, as you can re-use the working directory for multiple datasets. Note that this variable will only exist for the duration of your terminal session, so you should re-define it if you exit and then resume later.
export working_directory=/mnt/scratch/imputation mkdir -p -v ${working_directory}/{raw,mds,qc}
-
Copy your raw data to the
rawsubfolder of the working directory. If you have multiple.bedfile sets that you want to process, copy them all.cp -v my_sample.bed my_sample.bim my_sample.fam ${working_directory}/raw -
Next, start an interactive shell inside the container using one of the following commands.
The
--bind(Singularity) or--volume(Docker) parameters are used to make the working directory available inside the container at the path/data. This means that in all subsequent commands, we can use the path/datato refer to the working directory.Container platform Command Singularity singularity shell --hostname localhost --bind ${working_directory}:/data --bind /tmp imputation-protocol-latest.sifDocker docker run --interactive --tty --volume ${working_directory}:/data --bind /tmp gwas.science/imputation-protocol /bin/bash -
Inside the container, we will first go to
/data/mds, and run the scriptenigma-mdsfor your.bedfile set. The script creates the filesmdsplot.pdfandHM3_b37mds2R.mds.csv, which are summary statistics that you will need to share with your working group as per the ENIGMA Imputation Protocol.Note that this script will create all output files in the current folder, so you should use
cdto change to the/data/mds/samplefolder before running it.If you have multiple
.bedfile sets, you should run the script in a separate folder for each one. Otherwise the script may overwrite previous results when you run it again.If you have just one dataset:
mkdir /data/mds/sample cd /data/mds/sample enigma-mds --bfile /data/raw/my_sampleAlternatively, for multiple datasets:
mkdir /data/mds/{sample_a,sample_b} cd /data/mds/sample_a enigma-mds --bfile /data/raw/sample_a cd /data/mds/sample_b enigma-mds --bfile /data/raw/sample_b -
Next, we will set up our local instance of the Michigan Imputation Server.
The
setup-hadoopcommand will start a Hadoop instance on your computer, which consists of four background processes. When you are finished processing all your samples, you can stop them with thestop-hadoopcommand. If you are using Docker, then these processes will be stopped automatically when you exit the container shell.The
setup-imputationserverscript will then verify that the Hadoop instance works, and then install the 1000 Genomes Phase 3 v5 genome reference that will be used for imputation (around 15 GB of data, so it may take a while).If you are resuming analyses in an existing working directory, and do not still have the Hadoop background processes running, then you should re-run the setup commands. If they are still running, then you can skip this step.
setup-hadoop --n-cores 8 setup-imputationserver
If you encounter any warnings or messages while running these commands, you should consult with an expert to find out what they mean and if they may be important. However, processing will usually complete without issues, even if some warnings occur.
If something important goes wrong, then you will usually see a clear error message that contains the word "error". Please note that if the commands take more than an hour to run the setup, then that may also indicate that an error occurred.
-
Next, go to
/data/qc, and runenigma-qcfor your.bedfile sets. This will drop any strand ambiguous SNPs, then screen for low minor allele frequency, missingness and *Hardy-Weinberg equilibrium*, then remove duplicate SNPs (if necessary), and finally convert the data to sorted.vcf.gzformat for imputation.The script places intermediate files in the current folder, and the final
.vcf.gzfiles in/data/input/my_samplewhere they can be accessed by theimputationserverscript in the next step.Note that this script will create some output files in the current folder, so you should use
cdto change to the/data/qc/samplefolder (or similar) before running it.The input path is hard-coded, because the imputation server is quite strict and expects a directory with just the
.vcf.gzfiles and nothing else, so to avoid any problems we create that directory automatically.If you have just one dataset:
mkdir /data/qc/sample cd /data/qc/sample enigma-qc --bfile /data/raw/my_sample --study-name my_sampleAlternatively, for multiple datasets:
mkdir /data/qc/{sample_a,sample_b} cd /data/qc/sample_a enigma-qc --bfile /data/raw/sample_a --study-name sample_a cd /data/qc/sample_b enigma-qc --bfile /data/raw/sample_b --study-name sample_b -
Finally, run the
imputationservercommand for the correct sample population. This parameter can be set tomixedif the sample has multiple populations or the population is unknown.If a specific population (not
mixed) is specified, then the imputation server will compare the minor allele frequency of each variant in the sample to the population reference using a chi-squared test, and excludes outliers. If the population ismixed, then the step is skipped.Regardless of the population paramater, the imputation step always uses the entire 1000 Genomes Phase 3 v5 genome reference.
If you see any errors (such as
obvious stand flips detected), you may need to follow one or more of the workarounds in the troubleshooting section.imputationserver --study-name my_sample --population mixed
This process will likely take a few hours, and once it finishes for all your
.bedfile sets, you can exit the container using theexitcommand.All outputs can be found in the working directory created earlier. The quality control report can be found at
${working_directory}/output/my_sample/qcreport/qcreport.html(only if the population is notmixed), and the imputation results at${working_directory}/output/my_sample/local. The.zipfiles are encrypted with the passwordpassword. -
To merge all output files into a compact and portable
.ziparchive, the container includes themake-archivecommand. It will create a single output file at${working_directory}/my_sample.zipwith all output files.make-archive --study-name my_sample
Add the `--zstd-compress` option to the command to use a more efficient compression algorithm. This will take longer to run, but will create a smaller output file (around 70% smaller).
-
Once you have finished processing all your datasets, you can stop all background processes of the imputation server with the
stop-hadoopcommand. Then you can exit the container, move the output files to a safe location and delete the working directory (in case you need the disk space).stop-hadoop exit mv ${working_directory}/my_sample.zip /storage rm -rI ${working_directory}
You will need to convert your data before you start. Usually, this is very straight-forward with the plink command.
plink --ped my_sample.ped --map my_sample.map --make-bed --out my_sampleError: No chunks passed the QC step. Imputation cannot be started!
This error happens when your input data uses a different genome reference from what the imputation server expects (hg19). You can do this manually with LiftOver. To automatically do that, the container comes with the check hg19 command, which is based on the RICOPILI command buigue.
check hg19 --bfile /data/raw/my_sample The command will create a .bed file set at /data/raw/my_sample.hg19 which will have the correct genome reference. You should now use this corrected file to re-run the enigma-qc script, and then retry running the imputationserver command.
Error: More than 100 obvious strand flips have been detected. Please check strand. Imputation cannot be started!
If the imputationserver command fails with this error, then you will need to resolve strand flips in your data. To automatically do that, the container comes with the check flip command, which is based on the RICOPILI called checkflip4 and check-bim.
check flip --bfile /data/raw/my_sampleThe command will create a .bed file set at /data/raw/my_sample.check-flip which will have all strand flips resolved. You should now use this corrected file to re-run the enigma-qc script, and then retry running the imputationserver command.
Job execution failed: Velocity could not be initialized!
You likely have bad permissions in your working directory. You can either try to fix them, or start over with a fresh working directory.
Warning: At least one VCF allele code violates the official specification; other tools may not accept the file. (Valid codes must either start with a '<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or represent a breakend.)
This message occurs for example when you have indels in your raw data. You can ignore this message, because the Michigan Imputation Server will remove these invalid variants in its quality control step.
This suggests that your Hadoop instance may not be accepting new jobs. The fastest way to solve this is to stop and delete the instance, and then to re-run the setup.
# stop and delete
stop-hadoop
rm -rf /data/hadoop
# re-run setup
setup-hadoop --n-cores 8
setup-imputationserverPlease delete the /data/cloudgene folder inside the container and run setup-imputationserver again.