Skip to content

eQTL mapping analysis cookbook for RNA seq data

urmovosa edited this page Sep 13, 2017 · 42 revisions

This is the cookbook for performing the eQTL analysis using RNA-seq data. This software allows for eQTL mapping using linear models and direct meta-analysis of such data through weighted Z-score analysis. This cookbook illustrates only a part of the capabilities of our software. See for other options the other parts of the wiki.

Questions or suggestions?

You can contact the authors of this manual and software at anniqueclaringbould @ gmail.com, urmo.vosa @ gmail.com, patrickdeelen @ gmail.com, natalia.tsernikova @ gmail.com, bonder.m.j @ gmail.com, westra.harmjan @ gmail.com, dashazhernakova @ gmail.com or lude @ ludesign.nl for questions or suggestions regarding the software or the manual. This manual was written in cooperation with Marjolein Peters, Joyce van Meurs, Annique Claringbould and Pieter Neerincx.

The section describing the raw RNA-seq expression processing steps is standardised to match with the pipeline used in BBMRI-BIOS project (http://www.bbmri.nl/on_offer/bios/).

Manual contents

  1. Downloading the software and reference data
  2. Getting an account to the sharing server
  3. Before you start
  4. Step-by-step eQTL analysis

Downloading the software and reference data

You will need three programs to perform the QTL analysis.

  • The PLINK toolcase (v 1.07), to calculate MDS components.
  • The software to process and convert the genotype data, please download Genotype Harmonizer version 1.4.9.
  • The actual QTL mapping software. The latest version for using is eQTLGen for both cis- and trans/PRS-eQTL phase is version 1.2.4F, available on the eQTLGen shared account. To get access to the shared account, follow the steps in Getting an account to the sharing server.

Getting an account to the sharing server

We will use UMCG Genomics Coordination Center .sftp server named: cher-ami.hpc.rug.nl to provide you the files and tools needed for analysis steps, and for you to deliver your result files.

To get an access to the upload server please follow the steps:

  1. Generate the private-public key pair as documented here: http://wiki.gcc.rug.nl/wiki/RequestAccount

  2. Send an e-mail to GCC Helpdesk (Helpdesk dot GCC dot Groningen at gmail dot com) and your FrankeSwertzLab collaborator (anniqueclaringbould @ gmail.com) asking for access to the shared account for downloading the tools as well as a guest account for uploading your results. In this e-mail, please specify the name of your cohort and include your public key. The format is as follows:

     Dear GCC Helpdesk,
    
     I am working in the collaboration with FrankeSwertzLab, for the eQTLGen project with Annique Claringbould as my contact (cc’d). The name of my cohort is [ALSPAC/LIFE/Rotterdam, etc].
    
     Please provide me access to:
     1. a guest account to the cher-ami SFTP server for three months.
     2. the umcg-eqtlgen shared account.
    
     My public key is attached.
    
     Please cc: urmo.vosa@gmail.com and anniqueclaringbould@gmail.com to all the correspondence and provide them access to my guest account.
    
     Best regards,
     [Name]
     [Institution]
    

After getting the account you will gain access to the shared eQTLGen account (umcg-eqtlgen) where we supply files needed for running the cis-eQTL, trans-eQTL and PRS-QTL analyses.

Below you can find a representation o the folder structure for the umcg-eqtlgen account. For these RNA-seq-based analyses, you will need the files within the RNA_seq folder, as well as the files within the PRS_QTL folder.

|--RNA_seq
   |--AnnotationFiles
      |--ProbeAnnotation_STARv2.3.0e_Ensembl71.txt
      |--README
   |--HelpFiles
      |--Preprocessing
         |--genome.fa
         |--Homo_sapiens.GRCh37.71.cut.sorted.gtf
         |--meta-exons_v71_cut_sorted_22-05-14.bed
         |--ucsc_gencode.nochr.refFlat
         |--STARindex
      |--ListOfGeneticRiskFactors_20161121.txt
      |--README
   |--pipeline
      |--eqtl-mapping-pipeline-1.2.4F-SNAPSHOT-dist.tar.gz
   |--scripts
      |--preprocessing
         |--merge_count_files.sh
         |--README
      |--normalization
	 |--normalizeTMM.R
         |--README
      |--conditional_analysis
         |--script_run_cond_analysis.sh
         |--generate_regress_out_scripts.sh
	 |--template.xml
         |--README
|--Affymetrix
|--pipeline
|--cis_eQTL_results
|--trans_eQTL
|--PRS_QTL
   |--SummaryStatisticsDatabase_freeze1b # folder with all the harmonised summary statistic files
   |--SummaryStatisticsDatabase_freeze1b.md5 # please use this to check the integrity of files after download
   |--GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT # folder with PRS calculating tool
   |--README

Download instructions are here: http://wiki.gcc.rug.nl/wiki/DataSharing

If the instructions below do not work for your cluster setup, please consult with the extended manual: https://github.com/molgenis/systemsgenetics/wiki/Using-sharing-server-for-eQTLGen-analyses

        ## one possibility is to use lftp, if this is installed in your environment
        ## use these commands to download data from cher-ami server
        
        # go to the local folder where you want to run analysis:

        cd local/analysis/folder/

        # start lftp

        lftp

        # connect with the shared eQTLGen account:

        lftp :~> open -u umcg-eqtlgen,none -p 22 sftp://cher-ami.hpc.rug.nl

        # download needed folder:
        
        lftp :~> mirror folder_needed_for_analysis

        # exit from the remote server

        lftp :~> exit

You will also get a separate guest account (named umcg-guest[0-9]) that will be used to upload your results files in Step 10. You will only need this account after you have finished your analyses.

Additionally you will need the following reference files:

  • Giant genotype reference file, which can be downloaded here.
  • The annotation files for the expression probes, which can be downloaded from the SFTP. Please contact us if the annotation file for your platform is missing.

Before you start

In this manual we assume that you have basic knowledge on how to work with the java Virtual Machine (VM) and the command line. If you are not sure about working with the command line, please have a look at our introduction page and/or check the information about the Java VM (including the Java memory options). If you run into any problems with heapspace or out of memory errors please supply the -Xmx and -Xms command, see Java VM information. At each java command we supply the recommended settings.

Step-by-step eQTL analysis

This is a step-by-step guide which will guide you through the eQTL mapping process using our software. Our general method for RNA-seq data consists of ten steps described below:

Definitions

Throughout the manual, references to different full paths will be made. Here is an overview of these paths:

  • The expression file will be referred to as traitfile
  • The probe/trait/gene annotation file will be referred to as annotationfile
  • The full path of your genotype data will be referred to as genotypedir
  • The file linking expression sample identifiers to genotype sample identifiers will be referred to as genotypeexpressioncoupling
  • The file containing covariates will be defined as covariatefile

Descriptions of each of these files and their usage is detailed below, and their formats are described in the data formats section which can be found in the general QTL mapping documentation.

Step 1 - Preparation of expression data

Because our software uses a nonparametric test by default, you can use virtually any continuous data as trait values to map a variety of QTL effects. Depending on whether your RNA-seq data has been processed before or not, please continue with Already processed RNA-seq data or Unprocessed RNA-seq data.

Already processed RNA-seq data

Given that your expression data is already quantified, please check if your data adheres to the following criteria:

  • Paired-end sequenced
  • Sample quality control performed (e.g. using fastQC)
  • Adapters removed and quality trimming steps done (e.g. using CLIP and TRIM)
  • Aligned to a masked genome (1000G or 1000G+GoNL) on gene level using STAR with < 8 mismatches
  • Quantified using HTSeq count using the default option that removes reads with multiple mapping regions

If your data is quantified differently, please send an e-mail to anniqueclaringbould @ gmail.com, detailing the differences with the steps above. Together, we will then discuss whether you can perform the analyses using the data as it is.

If your data has gone through these processing steps before, please format your expression data in a simple tab separated text file. The format of this file is described here: [Phenotype file, covariate file] (https://github.com/molgenis/systemsgenetics/wiki/File%20descriptions#phenotype-file-covariate-file).

eQTL mapping will require an annotation of the probes/traits/genes in your traitfile. We store this annotation in a separate file. The format of this file is described here: Probe annotation file.

Unprocessed RNA-seq data

If your data has not been quantified yet, the next steps describe how to process the RNA-seq reads. Assuming the data is paired-end, you should make sure the reads are aligned and quantified on a per-gene basis for successful eQTL mapping.

The next programs will take care of quality control, alignment, and gene-level quantification. All steps are also available together in a pipeline, as described here: https://git.lumc.nl/sasc/bios-ansible/tree/master. Should you need support with using these programs or implementing the pipeline, please send an e-mail to anniqueclaringbould @ gmail.com. We will be happy to assist you in these steps.

In consecutive order, the following programs are used: FastQC, CLIP, TRIM, STAR, SAMtools, HTSeq not complete

QC

Using FASTQC v0.10.1

fastqc -q -t  <FQCCORES> <R1> <R2>

Use common practices for evaluating the quality of your RNAseq samples to remove any bad quality samples.

Adapter removal and quality trimming

Removing adapters by using cutadapt v1.1

python clip 
    -ra <R1>
        -qa <R1.replace('.gz',"_fastqc")>
        -sa <R1_clip>
        -ca <R1+".clipstat">
        -rb <R2>
        -qb <R2.replace('.gz',"_fastqc")>
        -sb <R2_clip>
        -cb <R2+".clipstat">

Quality-based trimming of sequencing reads by using sickle v1.2

sickle pe
    -f <R1_clip>
        -r <R2_clip>
        -o <R1_trim>
        -p <R2_trim>
        -l 25
        -t <QMODE>
        -s singles.fq
        -q 20

Read alignment

Using STAR v2.3.0e

You will need the following files to run STAR using a GoNL-masked genome to align to. These files are provided on the sharing server under RNA_seq/HelpFiles/Preprocessing/.

H.sapiens:
    hg19_gonl_masked:
      reference_fasta: "genome.fa"
      annotation_gtf: "Homo_sapiens.GRCh37.71.cut.sorted.gtf"
      annotation_bed: "meta-exons_v71_cut_sorted_22-05-14.bed"
      annotation_refflat: "ucsc_gencode.nochr.refFlat"
      star:
        genomeDir: "STARindex"

Actual command:

 star
         --outFileNamePrefix <outname>
         --readFilesIn <R1> <R2>
         --readFilesCommand zcat
         --genomeDir <STAR_REF>
         --genomeLoad NoSharedMemory
         --runThreadN <STARCORES>
         --outFilterMultimapNmax 5
         --outFilterMismatchNmax 8
         --outSAMunmapped Within
         --outSAMstrandField intronMotif

SAMtools settings Using v1.2

samtools view -bS -o <bam> <sam>

samtools sort <bam> <sorted_bam>

samtools index <sorted_bam>

samtools flagstat <bam>

BEDtools and PICARDtools settings

Using bedtools-2.17.0/bin/intersectBed:

intersect -abam <bam> -b <bed>

Using picard-tools-1.86/MarkDuplicates.jar:

java -jar <mark>
     I=<+bam>
     O=<+mdup_bam>
     M=<+bam+>.md.txt
     AS=TRUE

Using picard-tools-1.86/CollectRnaSeqMetrics.jar:

java -jar prime_bias
     REF_FLAT= <+REF_FLAT>
     STRAND=NONE
     I=<+bam>
     O=<+bam+>prime_bias.txt
     R=<+GENOME_FA>

Using picard-tools-1.86/CollectInsertSizeMetrics.jar:

java -jar insert_sizes
     I=<+bam>
     O=<+bam+>.isizes.txt
     H=<+bam+>.isizes.pdf

SAMtools settings cont'd Using v1.2

samtools sort -n <bam> <sorted_bam>

samtools index <sorted_bam>

samtools view -h <bam>

Quantify the expression of gene or exon

Using HTSeq v0.6.1p1

python htseq_count -s no -m union -t exon -i gene_id <sam> <gtf>

Merge the quantified expression files into an expression table

You can merge the expression files that have been quantified by HTSeq using merge_count_files.sh, a script you can find in the folder RNA_seq/scripts/preprocessing/.

Step 2 - Preparation of genotype data

Our software is able to use both unimputed called genotypes, as well as imputed genotypes and their dosage values. The software, however, requires these files to be in TriTyper format. We provide Genotype Harmonizer to harmonize and convert your genotype files. However, we first need to calculate MDS components to correct the expression data for population stratification.

Step 2A - Correction for population stratification

You will need to correct the expression data for 4 MDS components during the gene expression normalization step. First we need to get our data into the PLINK format. It is recommend to use the binary PLINK format and you should therefore have these files: mydata.bed, mydata.fam and mydata.bim.

📌 Note: this step should be performed using the nonimputed autosomal SNPs. If you run into problems feel free to contact patrickdeelen @ gmail.com for advice. It is also recommended to use pruned results. Please have a look at the PLINK manual (http://zzz.bwh.harvard.edu/plink/summary.shtml#prune) on how to obtain this.

Because PLINK uses complete linkage agglomerative clustering based on pairwise identity-by-state (IBS) distance, we first need to execute this command:

plink --bfile mydata --genome which creates the file plink.genome.

This file will have as many rows as there are unique pairs of individuals in the sample. For large samples with thousands of individuals, this file will take a considerable time to generate.

To obtain these MDS components, you can use a simple command using the PLINK tool to create a multidimensional scaling matrix. The command line will look like this:

plink --bfile mydata --read-genome plink.genome --cluster --mds-plot 4

This command creates the file plink.mds which contains one row per individual with the following fields:

  • FID - Family ID;
  • IID - Individual ID;
  • SOL - Assigned solution code;
  • C1 - Position on first dimension;
  • C2 - Position on second dimension;
  • C3 - Position on third dimension;
  • C4 - Position on fourth dimension;

You can find more information on PLINK website

✅ Please upload your log file to the sharing server.

Step 2B - Convert data to the TriTyper fileformat.

For the eQTL mapping the data needs to be in TriTyper format. Using GenotypeHarmonizer one can harmonize and convert genotype formats. In this step we will directly harmonize, filter and convert the genotype data. The data is harmonized to be matched to the GIANT release of 1000G. You can either run this step per each chromosome separately or for all chromosomes at once.

Important: If you are using imputed genotypes, make sure to use as an input format a file that contains the probabilities or dosages and not just genotype calls. In case of doubt please contact patrickdeelen @ gmail.com

📌 Note: please make sure to use verion 1.4.6, 1.4.7 or 1.4.9 of the Genotype Harmoinzer.

java -Xmx10g -Xms10g -jar ./GenotypeHarmonizer.jar -i {locationOfInputData} -I {InputType} -o {Outputlocation} -O TRITYPER -r {locationOf1000G-GiantVcfs} --refType VCF_FOLDER --update-id -cf 0.95 -hf 0.0001 -mf 0.01 -mrf 0.5 -ip 0

Input arguments: -i location of the input data (without extension), -I input type, -o output location, -O output type TRITYPER, -r location of the reference data, --refType type of reference data. See the Genotype Harmonization manual for further details on the input flags.

Note: it is not recommended to exclude samples at this step. This can easily be done at a later stage using the genotype to expression coupling file. Having all samples will help resolve possible sample mixups in step 4 more accurately.

✅ Please upload your log files to the sharing server.

If you ran the GenotypeHarmonizer per chromosome, you now need to merge all TriTyper folders per chromosome to one TriTyper folder containig data from all chromosomes. By using the following command you merge the individual TriTyper folders.

java -Xmx2g -Xms2g -jar ./eqtl-mapping-pipeline.jar --imputationtool --mode concat --in {folder1;folder2;folder3;ETC} --out {OutputFolder}

Input arguments:--in is a semicolon-separated list of input TriTyper folders with information per chromosome, --out is the output location of the merged TriTyper data. Note: in case of problems try putting quotes (") around your list of input folders

Great! You now have your genotype and expression data ready to go!

File Checklist

Before continuing, now is a good time to check whether your files are in the correct format and whether you have all the required files ready. Please check the following:

  • All required TriTyper files are in your genotypedir.
  • The expression files are formatted properly.
  • You have an annotationfile (you should have downloaded this from the sharing server at the start).
  • The identifiers of samples are identicial in your expression and genotype files. If not, you should make a file to link the proper samples. This file is called a genotypeexpressioncoupling file, the format is described here: Genotype - phenotype coupling. This file is not necessary if the names are already matched.

Step 3 - Formatting expression data

Step 3A - Removing outliers from expression data

Before performing the actual normalization it is essential that we remove outlier samples. These can easily be detected by performing a PCA on the data without prior centring of the expression values.

java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile --out {tmpoutdir} --qqnorm --logtransform --adjustPCA --maxnrpcaremoved 0 --stepsizepcaremoval 0 2>&1 | tee {tmpoutdir}/outlierDetection.log

Use the command below to extract the first 2 principle components. Please plot these components. You might observe a few clear outliers: please remove these samples from the analysis (see instruction below). In case of doubt feel free to contact patrickdeelen @ gmail.com for support, please provide a plot of the first component.

zcat ***.QuantileNormalized.Log2Transformed.PCAOverSamplesEigenvectors.txt.gz | awk 'BEGIN {OFS="\t"} NR == 1 {print "Sample","Comp1","Comp2"} NR > 1 {print $1,$2,$3}' > pc1_2.txt

You can use the R code below to create a plot of these components and create a file with samples to include.

pc1_2 <- read.delim("pc1_2.txt")

plot(pc1_2$Comp1, pc1_2$Comp2, xlab = "Component 1", ylab = "Component 2")

#
## Please fill in the outlier cut-off below
#
outlierCutoff <- #

pdf("OutlierDetection.pdf")
palette(c("dodgerblue2", "red2"))
plot(pc1_2$Comp1, pc1_2$Comp2, xlab = "Component 1", ylab = "Component 2", col = as.numeric(pc1_2$Comp1 < outlierCutoff)+1, main = paste("Removed", sum(pc1_2$Comp1 < outlierCutoff), "expression outlier samples"))
abline(v=outlierCutoff, col = "red2")
dev.off()

## NB!: please check on which side of the outlierCutoff the outlier samples are:
# If outlier samples are on the left of the cutoff line run:
write.table(pc1_2$Sample[pc1_2$Comp1 >= outlierCutoff], file = "includedExpressionSamples.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

# If outlier samples are on the right of the cutoff line run:
# write.table(pc1_2$Sample[pc1_2$Comp1 <= outlierCutoff], file = "includedExpressionSamples.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

✅ Please upload your log file and the plot of the first two components to the sharing server.

After you uploaded the plot and the log to the cher-ami server with your guest account, it is wise to remove most of the data in the tmpoutdir. The data, apart from the log and the pc1_2.txt, is no longer needed.

Step 3B - Expression data normalization

After removal of low quality samples we go for the actual normalization. Generally, continuous trait data need to be normalized prior to applying statistical testing. Our software package in combination with RNA-seq specific scripts provide different ways of normalizing your data contained in the traitfile.

Our general normalization strategy for RNAseq-based expression data consists of the following steps:

  1. TMM normalization
  2. Removing probes with no variance
  3. Log2 transformation
  4. Probe centering and scaling (Z-transform): (ExpressionProbe,Sample = ExpressionProbe,Sample - MeanProbe) / Std.Dev.Probe
  5. Removal of covariates - Correct gene expression data for the first 4 MDS of the GWAS data - to remove possible population stratification
    • We run a Generalized Linear Model with the probes as dependent variables, and the GWAS PCs as orthogonal covariates. For the remainder of the analysis, we use the residuals of this model.

Did not do this in BIOS --> all Dutch samples. We did remove samples with >3SD heterogeneity score

  1. Principal component adjustment
    • In the Principal Component Analysis, the software tries to convert the normalized and standardized expression results into a set of values of uncorrelated variables called Principal Components (PCs). The number of PCs is equal to the number of samples, and the first PCs explain a higher variance than the last PCs. By adjusting the normalized and standardized expression data for a set of PCs (like we did with the removal of the covariates - use of a Generalized Linear Model), we try to remove batch effects in the data. Note: the PCs will be used in Step 5.

The normalization should be performed only on samples that are not marked as outliers in the previous step. To do this we need a file with the identifiers of the samples to include. Each sample identifier should be on a separate line and no header should be used. If you used the example R script you can use includedExpressionSamples.txt or you can make it based on the first column in the pc1_2.txt file.

  1. Include only samples that are not outliers: java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile --out {outdir} --sampleInclude {includeFile} 2>&1 | tee {outdir}/remove_outliers.log

  2. Use the expression matrix with included samples to run the TMM normalization. You can find the normalization script on the sharing server, in the folder RNA_seq/scripts/normalization/. Before running this script, please ensure that the R package edgeR is installed edgeR documentation. Rscript normalizeTMM.R traitfile.SampleSelection.ProbesWithZeroVarianceRemoved.txt.gz {outdir}/traitfile.SampleSelection.ProbesWithZeroVarianceRemoved.TMM.txt 0.01 2>&1 | tee {outdir}/TMM_normalization.log

  3. Use the newly generated TMM-normalized expression matrix as traitfile in the following command: java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile.SampleSelection.ProbesWithZeroVarianceRemoved.TMM.txt --out {outdir} --logtransform --centerscale --adjustcovariates --cov {covariatefile} --adjustPCA --maxnrpcaremoved 20 --stepsizepcaremoval 2 2>&1 | tee {outdir}/normalization.log

For details on the covariate file see this link. We want to use the PLINK MDS results as confounders for our analysis. Before you can use the PLINK MDS results you need to run this command: awk 'BEGIN { OFS = "\t" } {print $2,$4,$5,$6,$7}' < plink.mds > plinkMds.txt.

❗ Please make sure that the sample identifiers in this file correspond to the sample identifiers in the expression data.

✅ Please upload your log file to the FTP server.

Check your output

Running the general normalization procedure yields a number of files in the outdir that you specified. The default procedure will generate file suffixes listed below. Suffixes will be appended in the default order as described above. Selecting multiple normalization methods will add multiple suffixes. Please do not remove or replace ANY intermediate files from this normalization step as they may be used in subsequent steps.

Suffix Description
TMM TMM Normalized trait data.
ProbesWithZeroVarianceRemoved All probes without any variance are removed.
SampleSelection Data after selection of the samples.
Log2Transformed Log2 transformed data.
ProbesCentered Centered probes.
SamplesZTransformed Z-transformed samples.
CovariatesRemoved Expression data adjusted for covariates.
PCAOverSamplesEigenvalues Eigenvalues created during eigenvalue decomposition of the expression sample correlation matrix (created from the Quantile Normalized, Log2 Transformed, Z-transformed data).
PCAOverSamplesEigenvectors Eigenvectors created during eigenvalue decomposition of the expression sample correlation matrix (created from the Quantile Normalized, Log2 Transformed, Z-transformed data).
PCAOverSamplesEigenvectorsTransposed Eigenvectors transposed.
PCAOverSamplesPrincipalComponents Principal Components describing the sample correlation matrix (created from the Quantile Normalized, Log2 Transformed, Z-transformed data).
[number]PCAsOverSamplesRemoved Expression matrix where this number of PC components have been removed.
25PCAsOverSamplesRemoved Expression matrix where the first 25 PC components have been removed.

Step 4 - MixupMapper

In our paper published in Bioinformatics (Westra et al.: MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects) we have shown that sample mix-ups often occur in genetical genomics datasets (i.e. datasets with both genotype and gene expression data). Therefore, we developed a method called MixupMapper, which is implemented in the QTL Mapping Pipeline. This program performs the following steps:

  1. A cis-eQTL analysis on the dataset:
    • using a 250 kb window between the SNP and the mid-probe position
    • performing 10 permutations to control the false discovery rate (FDR) at 0.05
  2. Calculation of how well the expression data matches the genotype data. For details on this process please consult the paper.

Data checklist

  1. Note down the full path to your TriTyper genotype data directory. We will refer to this directory as genotypedir.
  2. Determine the full path of the expression data produced in step 3, the file ending with CovariatesRemoved.txt.gz. We will refer to this path as traitfile.
  3. Locate your annotationfile. Also note down the platformidentifier, which is written in the first column of the annotationfile. Please contact us if the annotation file for your platform is missing.
  4. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  5. Find a location on your hard drive to store the output. We will refer to this directory as outputdir.

Commands to be issued

The MixupMapper analysis can be run using the following command: java -Xmx10g -Xms10g -XX:StringTableSize=319973 -jar eqtl-mapping-pipeline.jar --mode mixupmapper --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --snps {snpfile} (--testall) (--gte {genotypeexpressioncoupling}) 2>&1 | tee {outdir}/mixupmapping.log

It is more efficient to test a subset of SNPs as testing a full 1000G imputed dataset will take a lot of time. To do this you can use this snpfile and adding this flag: --snps {snpfile} (remember to use the full path).

If your genotypes and/or expression data includes more samples than those which are matched based on sample names, it is beneficial to test all possible combinations. This can be performed using the following command line switch --testall.

If you are running the software in a cluster environment, you can specify the number of threads to use (nrthreads) by appending the command above with the following command line switch --threads nrthreads (nrthreads should be an integer).

Note that the --gte, --threads and --snps flags are optional, --gte only applies if you are using a genotypeexpressioncoupling file.

Check your output

MixupMapper is a two stage approach. Therefore the default procedure creates two directories in the outdir you specified: cis-eQTLs and MixupMapping. Both directories contain a different set of output files described below.

cis-eQTLs directory

This directory contains output from a default cis-eQTL mapping approach. However, we are not interested in this folder, these files are only necessary to create the actual mixupmapping output which can be found in the MixupMapping directory.

MixupMapping directory

File Description
Heatmap.pdf This file contains a visualization of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the expression samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colours. Samples are sorted alphabetically on both axes.
BestMatchPerGenotype.txt This file shows the best matching trait (i.e. expression) samples per genotype: the result matrix (MixupMapperScores.txt) is not symmetrical. Therefore scanning for the best sample per genotype may yield other results than scanning for the best sample per trait.
BestMatchPerTrait.txt This file shows the best matching genotype sample per trait sample.
MixupMapperScores.txt This file contains a matrix showing the scores per pair of samples (combinations of traits and genotypes).

In the BestMatchPerGenotype.txt and BestMatchPerTrait.txt files, you can find the best matching trait sample for each genotyped sample and vice versa:

  • 1st column = genotyped sample ID, or trait sample ID dependent on file chosen (see above)
  • 2nd column = trait sample originally linked to genotype sample ID in column 1, or genotype sample originally linked with trait sample in column 1
  • 3rd column = the MixupMapper Z-score for the link between the samples in column 1 and 2
  • 4th column = best matching trait (for BestMatchPerGenotype.txt) or best matching genotype (for BestMatchPerTrait.txt)
  • 5th column = the MixupMapper Z-score for the link between the samples in column 1 and 4
  • 6th column = this column determines whether the best matching trait or genotype is identical to the sample found in column 2

Example of BestMatchPerGenotype.txt

Trait   OriginalLinkedGenotype  OriginalLinkedGenotypeScore     BestMatchingGenotype    BestMatchingGenotypeScore       Mixup
Sample1		Sample1		-11.357			Sample1		-11.357	false
Sample2		Sample2		-15.232			Sample2		-15.232	false
Sample3		Sample3		-3.774			Sample4		-6.341	true
Sample4		Sample4		-3.892			Sample3		-12.263	true

In this example there are two mix-ups indentified. If you don't have any mix-ups you are done, otherwise you need to resolve or remove these mix-ups. Please check this separate document for instructions on fixing mix-ups.

Always rerun the sample Mix-up Mapper after removing or changing sample IDs and check if you actually fixed/removed mix-ups instead of creating more!

✅ Please upload your logs file to the sharing server.

Step 5 - Removing principal components

Prior to QTL mapping, you will need to remove the first 20 expression components to increase the power to detect cis- and trans-QTLs. As we have seen that some PCs explain genetic variation, we perform our analysis on data only adjusted for PCs which explain no genetic variation. To identify the PCs that have no genetic association, we perform a QTL mapping on the principal component eigenvector matrix (PCAOverSamplesEigenvectorsTransposed.txt.gz). We call PCs genetically associated when they have a FDR of 0 (thus selecting significantly affected components only).

Data checklist

  1. Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as genotypedir.
  2. Determine the full path of the expression data you want to use, the file created in step 3 before PC correction. The file name ends with "*.CovariatesRemoved.txt.gz".
  3. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  4. Find a location on your hard drive to store the output. We will refer to this directory as outputdir.

As with MixupMapper, SNPs will only be tested with a minor allele frequency of > 0.05, a Hardy-Weinberg P-value > 0.001 and a call-rate > 0.95. Examples of these files can be found here: resources

Commands to be issued

Use the following command to run the mapping on the eigenvectors and to subsequently remove the PCs which are not under genetic control:

java -Xmx20g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --metamode nongeneticPcaCorrection --in {TriTyper location} --inexp {traitfile} --gte {gtefile} --out {outdir} --adjustPCA --stepsizepcaremoval 20 --maxnrpcaremoved 20 --nreqtls 500000 2>&1 | tee {outdir}/removeExpressionComponents.log

Note that the --gte switch is required for this step so you do need genotypeexpressioncoupling file. Please make sure any sample mixups are resolved in this file. If the sample identifiers in your expression and genotype data are identical you can simple create one using the command below.

awk 'BEGIN {OFS="\t"}{print $1,$1}' < Individuals.txt > gte.txt

✅ Please upload your log file to the sharing server.

Check your output

After running the nongeneticPcaCorrection command, the outdir will contain the output of a full QTL mapping on the eigenvectors for your dataset. At the location of your input expression data the program has written a new PC corrected expression file, which was corrected for all components but the ones under genetic control (FDR 0). So if the process has finished succesfully there should be a new file which ends with 20PCAsOverSamplesRemoved-GeneticVectorsNotRemoved.txt.gz. This is the traitfile which will be used in the subsequent cis-QTL mapping.

Step 6 - Performing cis-eQTL analysis

Everything should now be set up to perform the cis-eQTL analysis using your RNA-seq gene-level quantified and normalized data.

The primary cis-QTL mapping can be performed using a single command. Standard settings for the cis-eQTL mapping are: HWEP > 0.0001, MAF > 0.01, and Call Rate > 0.95. For cis-eQTL mapping, the maximum distance between the SNP and the middle of the probe is 1,000,000 bp (=1 Mbp). To control for multiple testing, we perform 10 permutations, thereby shuffling sample labels to calculate the false discovery rate (FDR) controlled at 0.05.

Data checklist

  1. Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as genotypedir.
  2. Determine the full path of the trait data you want to use, created in step 5. We will refer to this path as traitfile.
  3. Locate your expression probe annotation file: annotationfile. Also note down the platformidentifier. Please contact us if the annotation file for your platform is missing.
  4. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  5. Find a location on your hard drive to store the output. We will refer to this directory as outputdir.

Commands to be issued

The command needed to be issued for this cis-QTL analysis:

java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --cis --binary --perm 10 --maf 0.01 2>&1 | tee cisEQtlMapping.log

If you are running the software in a cluster or a multithreaded environment, you can specify the number of threads to use (nrthreads) by appending the command above with the following command line switch --threads nrthreads (nrthreads should be an integer).

Note that you should apply the --gte switch if you are using a genotypeexpressioncoupling file.

Important: When this step is running successfully, you can already continue with step 7: you do not need the output from this step right away.

QTL Mapping output - Binary mode

This section describes the binary output files generated using the --binary command line switch.

File Description
Dataset.dat Zscore matrix for the data analysis.
Dataset-ColNames.txt.gz Column names for the Zscore matrix.
Dataset-RowNames.txt.gz Row names for the Zscore matrix.
Dataset-PermutationRound-n.dat Zscore matrix for the permuted data analysis
Dataset-PermutationRound-n-ColNames.txt Column names for the permuted Zscore matrix.
Dataset-PermutationRound-n-RowNames.txt Row names for the permuted Zscore matrix.
excludedSNPsBySNPFilter.txt.gz SNPs excluded by the user.
excludedSNPsBySNPProbeCombinationFilter.txt.gz The file with no gene expression probes located within 250.000bp of this SNP position.
ProbeQCLog.txt.gz List of probes excluded from analysis.
SNPQCLog.txt.gz QC information for all tested SNPs, including MAF, HWE-pvalue and call-rate (which is always 1.0 for imputed SNPs).
UsedSettings.txt The settings used for the eQTL mapping.
cisEQtlMapping.log Log file of the mapping.

✅ Please upload all the above mentioned result files to the sharing server

Cis-eQTL mapping completed

🎉 You are done with the cis-eQTL analysis. Please drop a mail to urmo.vosa @ gmail.com and anniqueclaringbould @ gmail.com when your uploads are completed.

Step 7 - Performing conditional cis-eQTL analysis

To maximize our power to identify distal (trans-) effects, we want to regress out all the independent cis-eQTL effects. To do so, we will perform conditional cis-eQTL analysis which enables us to identify secondary/tertiary/quaternary/... cis-eQTLs. This is done by conducting a series of cis-eQTL analyses by regressing each analysis round out all the significant cis-eQTL effects from all the previous rounds. All independent cis-eQTL effects will be regressed out in trans-phase (Steps 8-9).

Please make a separate directory for the conditional analysis (conditional_output_dir). Three necessary files for conditional analysis are available on cher-ami, RNA_seq/scripts/conditional_analysis/. Please copy those to the conditional analysis directory.

File: RNA_seq/scripts/conditional_analysis/template.xml

This file is settings template for cis-eQTL analysis.

File: RNA_seq/scripts/conditional_analysis/generate_regress_out_scripts.sh

This script will automatically generate new scripts and config files if another round of cis-mapping is necessary.

File: RNA_seq/scripts/conditional_analysis/script_run_cond_analysis.sh

This script will check how many conditional cis-eQTL rounds are necessary and will start generate_regress_out_scripts.sh in each new round. For each round, separate output folder is made (output_round*_RegressedOut). The result file (eQTLProbesFDR0.05-ProbeLevel.txt) of last round should contain no significant results any more.

Use this command for running the analysis in conditional analysis directory:

sh script_run_cond_analysis.sh {folder with eqtl-mapping-pipeline.jar} {traitfile} {genotypedir} {conditional_output_dir} {annotationfile} {genotypeexpressioncoupling} 2>&1 | tee {conditional_output_dir}/conditional_cis_mapping.log

Important: Please use the same input files as in previous cis-eQTL mapping.

Important: Please make sure to use this exact order of arguments. The conditional_output_dir should be the directory in which you are running this command. The genotype to expression coupling file is optional: if you do not have one you can submit the job without that argument. Please also ensure that none of the folders have a final slash in the command above.

Important: If you are working with a job scheduler, you may have to adjust the way run.sh files are created or submitted. This can be done by adjusting both bash scripts above. Please contact anniqueclaringbould @ gmail.com if you need any assistance.

These analyses will result in one file with all independent cis-eQTLs that can be found in your dataset. You will later use this file to regress out cis-eQTLs before mapping trans-eQTLs.

When the job is finished, please go to the folder of the last round and check whether there are no results in eQTLsFDR0.05-ProbeLevel.txt.

✅ Please upload the toRegressOut.txt file from the last round of cis mapping ({conditional_output_dir}/output_round{last_round}_RegressedOut/toRegressOut.txt) to the sharing server. This file will be used in Step 8 for regressing out the cis-eQTLs before trans-eQTL mapping.

✅ Please upload the log files created to cond_output_dir: conditional_cis_mapping.log, cond_eQTLs_overview.log and toRegressOut.log

Step 8 - Performing the trans-eQTL analysis

In a single command, the trans-eQTL mapping can be performed. In the trans-eQTL mapping phase we will use the dataset processed by the previous steps of the pipeline.

Standard settings for the trans-eQTL mapping are: HWEP > 0.0001, MAF > 0.01, and Call Rate > 0.95. For trans-eQTL mapping, the minimum distance between the SNP and the middle of the probe equals 5,000,000bp. To control for multiple testing, we perform 10 permutations by shuffling sample labels, to calculate the false discovery rate (FDR) at 0.05.

We have found that removing cis effects greatly enhances the power to detect trans effects. Consequently, our software provides a way to correct your traitfile data for the cis effects that you created in Step 7 - Performing conditional cis-eQTL analysis. Note down the full path of the file containing the cis-eQTL effects to be removed: the toRegressOut.txt file you created and uploaded in the previous step. We will refer to this file as cisqtlfile. The format of this file is identical to the QTL file.

❗️ Important: Removing QTL effects only properly works on imputed genotype data. The program will not remove QTL effects when imputation dosages are not available.

Due to the limitations on the computational power and file sizes we test only preselected list of SNPs in this phase, consisting of EBI GWAS Catalogue and Immunobase (all GWAS P<=5e-8). We will refer to this file as snpsFile.

Data checklist

  1. Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as genotypedir.
  2. Determine the full path of the trait data you want to use, created in step 5. We will refer to this path as traitfile.
  3. Locate your expression probe annotation file: annotationfile. Also note down the platformidentifier. Please contact us if the annotation file for your platform is missing.
  4. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  5. Use the results of your conditional cis-eQTL analysis to regress out the cis-eQTL effects. It is located in {conditional_output_dir}/output_round{last_round}_RegressedOut/toRegressOut.txt. We will refer to this file as cisqtlfile.
  6. Use the list of preselected SNPs for trans-eQTL analysis. This file is referred as snpsFile and is available under the umcg-eqtlgen account in the cher-ami server (RNA_seq/HelpFiles/ListOfGeneticRiskFactors_20161012.txt).
  7. Find a location on your hard drive to store the output. We will refer to this directory as outdir.

Commands to be issued

Use the following command to run the trans-eQTL analysis:

java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --perm 10 --maf 0.01 --snps {snpsFile} --regressouteqtls {cisqtlfile} 2>&1 | tee {outdir}/transEQtlMapping.log

❗ While you are waiting for the trans-eQTL analysis to complete, please already proceed with calculating the genetic risk scores in Step 9A.

QTL Mapping output - Binary mode

This section describes the binary output files generated using the --binary command line switch.

File Description
Dataset.dat Zscore matrix for the data analysis.
Dataset-ColNames.txt.gz Column names for the Zscore matrix.
Dataset-RowNames.txt.gz Row names for the Zscore matrix.
Dataset-PermutationRound-n.dat Zscore matrix for the permuted data analysis.
Dataset-PermutationRound-n-ColNames.txt Column names for the permuted Zscore matrix.
Dataset-PermutationRound-n-RowNames.txt Row names for the permuted Zscore matrix.
excludedSNPsBySNPFilter.txt.gz SNPs excluded by the user.
excludedSNPsBySNPProbeCombinationFilter.txt.gz The file with no gene expression probes located within 250.000bp of this SNP position.
ProbeQCLog.txt.gz List of probes excluded from analysis.
SNPQCLog.txt.gz QC information for all tested SNPs, including MAF, HWE-pvalue and call-rate (which is always 1.0 for imputed SNPs).
UsedSettings.txt The settings used for the eQTL mapping.
transEQtlMapping.log Log file of the trans mapping.
check.md5 md5 file Zscore matrix.
check-PermutationRound-n.md5 md5 file for permuted Zscore matrix.

✅ Please upload all the above mentioned result files to the sharing server.

This step produces the file ExpressionData.SampleSelection.TMM.Log2Transformed.ProbesCentered.SamplesZTransformed.CovariatesRemoved.20PCAsOverSamplesRemoved-GeneticVectorsNotRemoved.txt.gz-EQTLEffectsRemoved.txt.gz into the same directory where your input expression data is. From the expression file with the 'EQTLEffectsRemoved'-suffix, the cis-eQTL effect are removed and this is the expression input for the next step.

❗ If this file was not written, please check whether you used the correct version 1.2.4F of QTL mapping pipeline.

Step 9 - Performing the Polygenic Risk Score eQTL analysis

In this part of the pipeline we will calculate polygenic risk scores for all the individuals in your eQTL dataset, using publicly available GWAS summary statistics for traits. Those risk scores we will correlate with expression levels of all genes to identify novel trait-associated "hub" genes and pathways.

Step 9A

The software for calculating genetic risk scores is available under umcg-eqtlgen account in cher-ami server (PRS_QTL/GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT). Please make sure to use v0.1.2c of the software. Usage instructions can be seen by the command: java -jar GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT/GeneticRiskScoreCalculator.jar

NB! This tool is tested and works with Java versions 7 and 8, please make sure that you are using one of those!

The command for PRS calculation:

java -Xmx40g -Xms40g -jar GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT/GeneticRiskScoreCalculator.jar -gi {InputTriTyperFolder} -gt TRITYPER -i {SummaryStatisticsFolder} -o {OutputFolder} -p 5e-8:1e-5:1e-4:1e-3:1e-2 -r 0.1 -w 250000:5000000 -er 6:25000000-35000000 2>&1 | tee {OutputFolder}/PRSCalculation.log

  • NB! If you rerun the job, please remove the output folder constructed by the last job before!

✅ Please upload PRSCalculation.log to the sharing server.

Arguments:

-gi - this is your genotype folder constructed in Step 2 (InputTriTyperFolder), meaning it has to be harmonized to match GIANT 1000G p1v3.

-i - this is folder consisting gzipped GWAS summary statistic files (SummaryStatisticsFolder) and is available under cher-ami eQTLGen account (PRS_QTL/SummaryStatisticsDatabase_freeze1b/). These files are preprocessed to match GIANT 1000G p1v3.

Please note the following! Most of the datasets included were publicly available. The Terms and Conditions, when available on the corresponding web page of respective GWAS consortum, were checked and adhered. However, these files are meant to use in this project only.

Therefore, if you use those summary statistics in any other project than eQTLGen, it is your responsibility to make sure that you:

1. Adhere to the Terms and Conditions indicated on the respective consortium web page.

2. Cite correctly the respective study in all the resulting publications.

-o - output folder of the genetic risk score calculation. This incorporates TT folder which consists scaled risk scores in TriTyper format. This folder will be used as input for Step 9B, analogously to trans-eQTL mapping.

-p 5e-8:1e-5:1e-4:1e-3:1e-2 - this flag specifies the GWAS significance thresholds used for constructing the risk score. For this project we construct five risk scores per trait (GWAS P: 5e-8, 1e-5, 1e-4, 1e-3, 1e-2).

-r 0.1- this flag specifies the R2 threshold for clumping.

-w 250000:5000000 - this flags specifies the clumping strategy. Top SNP is defined by the lowest P-value or the highest absolute effect size if there are several equal P-values. In the first round of clumping all the SNPs in the LD (R2>0.1) are removed from the distance of 250 Kb from the top SNP. Second round of clumping removes all the remaining SNPs in LD (R2>0.1) in the extended distance of 5 Mb.

-er 6:25000000-35000000 - this flag removes the SNPs from HLA region (chr6:25Mb-35Mb) from the calculation of genetic risk scores.

Step 9B

In this step we perform PRS-eQTL analysis on PC-corrected expression matrix.

Data checklist

  1. Note down the full path to your TriTyper PRS data directory, created in step 9A. We will refer to this directory as PRSdir.
  2. Determine the full path of the expression data you want to use, created in step 8. We will refer to this path as traitfile_cis_removed.
  3. Locate your expression probe annotation file: annotationfile. Also note down the platformidentifier. Illumina HT12-V3: "GPL6947", Illumina HT12-V4: "GPL10558", Illumina HT12-V4 WG-DASL: "GPL13938". Please contact us if the annotation file for your platform is missing.
  4. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  5. Find a location on your hard drive to store the output. We will refer to this directory as outputdir.

Commands to be issued

Use the following command to run the PRS-eQTL mapping:

java -jar -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m eqtl-mapping-pipeline-1.2.4F-SNAPSHOT/eqtl-mapping-pipeline.jar --mode metaqtl --in {PRSdir} --out {outdir} --inexp {traitfile_cis_removed} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --hwe 0 --maf 0 --perm 10 2>&1 | tee {outdir}/PRSeQtlMappingAdjPC.log

The output files are named the same as indicated in Step 8 binary output file description.

✅ Please upload all the files in the output directory to the sharing server.

Step 9C

In this step we perform trans-eQTL and PRS-eQTL analyses on expression matrix not corrected for 20 first PCs. We are interested for those results to evaluate the principal component correction effect on the observed eQTLs.

trans-eQTL mapping on PC-uncorrected expression data

For that, we first have to rerun the trans-eQTL analysis using this time the output expression matrix from Step 3 ending with the *.CovariatesRemoved.txt.gz. This is by default in the same folder where PC-corrected expression matrix was written.

Data checklist

  1. Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as genotypedir.
  2. Determine the full path of the trait data you want to use, created in step 3. We will refer to this path as traitfile.
  3. Locate your expression probe annotation file: annotationfile. Also note down the platformidentifier, which is written in the first column of the annotationfile. Please contact us if the annotation file for your platform is missing.
  4. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  5. Use the results of your conditional cis-eQTL analysis to regress out the cis-eQTL effects. It is located in {conditional_output_dir}/Regress_cis_eQTLs.txt. We will refer to this file as cisqtlfile.
  6. Use the list of preselected SNPs for the second trans-eQTL analysis. This file is referred as snpsFile and is available under the umcg-eqtlgen account in the cher-ami server (RNA_seq/HelpFiles/ListOfGeneticRiskFactors_20161121.txt).
  7. Find a location on your hard drive to store the output. We will refer to this directory as outdir.

Use the following command to run the trans-eQTL analysis:

java -jar -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --perm 10 --maf 0.01 --snps {snpsFile} --regressouteqtls {cisqtlfile} 2>&1 | tee {outdir}/NonPcTransEQtlMapping.log

✅ Please upload the all the files in outdir, including NonPcTransEQtlMapping.log to the sharing server.

This step produces the file ExpressionData.SampleSelection.QuantileNormalized.Log2Transformed.ProbesCentered.SamplesZTransformed.CovariatesRemoved.txt.gz-EQTLEffectsRemoved.txt.gz into the same directory where your input expression data is. From this expression file, the cis-eQTL effect are removed and this is the expression input for the following command.

PRS-eQTL mapping on PC-uncorrected expression data

We then run again the PRS-eQTL analysis, using the expression matrix not corrected to PCs as an input.

Data checklist

  1. Note down the full path to your TriTyper PRS data directory, created in step 9A. We will refer to this directory as PRSdir.
  2. Determine the full path of the expression data you want to use, created in step 9C. We will refer to this path as traitfile_PCs_not_removed_cis_removed.
  3. Locate your expression probe annotation file: annotationfile. Also note down the platformidentifier. Illumina HT12-V3: "GPL6947", Illumina HT12-V4: "GPL10558", Illumina HT12-V4 WG-DASL: "GPL13938". Please contact us if the annotation file for your platform is missing.
  4. (Optional) Locate your genotypeexpressioncoupling if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
  5. Find a location on your hard drive to store the output. We will refer to this directory as outputdir.

Commands to be issued

Use the following command to run the second PRS-eQTL mapping:

java -jar -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m eqtl-mapping-pipeline-1.2.4F-SNAPSHOT/eqtl-mapping-pipeline.jar --mode metaqtl --in {PRSdir} --out {outdir} --inexp {traitfile_PCs_not_removed_cis_removed} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --hwe 0 --maf 0 --perm 10 2>&1 | tee {outdir}/PRSeQtlMappingNotAdjPC.log

✅ Please upload all the files in the output directory to the sharing server.

Step 10 - Uploading your result files

If you are ready with the analyses, please upload the the data to the server to your guest account acquired at the start of the analysis (named umcg-guest[0-9]). Please upload all the needed output files, .log files and .md5 files from Step 2A, Step 2B, Step 3A, Step 4, Step 5, Step 6, Step 7, Step 8, Step 9A, Step 9B and Step 9C.

If you analysed several datasets, please organise your analysis results into separate folders (e.g. LIFEa1 and LIFEb3, etc.). For PRS-eQTL analysis, please organise your results into subfolders (PRS_eQTL_results_PC_corrected and PRS_eQTL_results_PC_uncorrected).

The example of folder structure is following:

|--LIFEa1
   |--preparation_steps
      |--Step_2
      |--Step_3
      |--Step_4
      |--Step_5
      |--Step_6
   |--cis_eQTL_results
      |--cis_eQTL_results_primary
      |--cis_eQTL_results_conditional
   |--trans_eQTL_results
      |--trans_eQTL_results_PC_corrected
      |--trans_eQTL_results_PC_uncorrected
   |--PRS_eQTL_results
      |--PRS_eQTL_results_PC_corrected
      |--PRS_eQTL_results_PC_uncorrected
|--LIFEb3
   |--preparation_steps
      |--Step_2
      |--Step_3
      |--Step_4
      |--Step_5
      |--Step_6
   |--cis_eQTL_results
      |--cis_eQTL_results_primary
      |--cis_eQTL_results_conditional
   |--trans_eQTL_results
      |--trans_eQTL_results_PC_corrected
      |--trans_eQTL_results_PC_uncorrected
   |--PRS_eQTL_results
      |--PRS_eQTL_results_PC_corrected
      |--PRS_eQTL_results_PC_uncorrected

The upload instructions are here: http://wiki.gcc.rug.nl/wiki/DataSharing

If the instructions below do not work for your cluster setup, please consult with the extended manual: https://github.com/molgenis/systemsgenetics/wiki/Using-sharing-server-for-eQTLGen-analyses

        ## one possibility is to use lftp, if this is installed in your environment
        ## use these commands to upload/download data from cher-ami server
        
        # go to the local folder where data is stored:

        cd local/folder/with/your/data/

        # start lftp

        lftp

        # connect with your guest account:

        lftp :~> open -u [your_guest_accountname],none -p 22 sftp://cher-ami.hpc.rug.nl

        # make remote upload directories and subdirectories for trans-eQTL and PRS-QTL results. e.g:

        lftp :~> mkdir LIFEa1
        lftp :~> mkdir LIFEa1/trans_eQTL_results
        lftp :~> mkdir LIFEa1/PRS_eQTL_results
        lftp :~> mkdir LIFEa1/PRS_eQTL_results/PRS_eQTL_results_PC_corrected
        lftp :~> mkdir LIFEa1/PRS_eQTL_results/PRS_eQTL_results_PC_uncorrected
        
        # go to the folder where you upload the results. e.g:

        lftp :~> cd LIFEa1/trans_eQTL_results         
 
        # use put command for uploading the individual files and/or mirror command for uploading whole directories

        lftp :~> put individual_result_file_in_local_server
        lftp :~> mirror -R result_folder_in_local_server
        
        # exit from the remote server when uploaded all necessary files:

        lftp :~> exit
  1. Please send additional e-mail to anniqueclaringbould @ gmail.com when the data has finished uploading.
  2. We will do the sanity check and you will get the feedback whether the upload of the data succeeded. By default, your access to cher-ami will be active for three months, but it can be extended if needed.

NB! In case of issues with uploading, please contact to GCC Helpdesk with cc: anniqueclaringbould @ gmail.com.

🎉 You are done with the trans-eQTL and PRS-eQTL analysis.

Clone this wiki locally