ArchIE (ARCHaic Introgression Explorer) is a method for reference-free inference of archaic local ancestry. It uses coalescent simulations to train a model to distinguish between archaic and non-archaic windows. The process is as follows:
- run
bash create_training.sh
to simulate training data and calculate features (from the directory) - run
bash create_test.sh
to simulate test data (from the directory) - run
Rscript train.R
to train a logistic regression model (from the directory) - run
python data/calc_stats_window_data.py -s data.snp -i data.ind -a data.geno -r ref.geno -c 1 -b $start -e $end -w 50000 -z 50000 | gzip > output.gz
The last step will calculate features in 50KB windows with a 50KB step size (i.e. nonoverlapping) from $start
to $end
on chromosome 1.
You can find out more about the options for calc_stats_window_data.py
using the -h flag:
python data/calc_stats_window_data.py -h
usage: calc_stats_window_data.py [-h] -s SNP -i IND -a ADMIX [-r REFERENCE] -c
CHROM -b BEGIN -e END -w WINDOW -z STEP
Script to calculate stats for ArchIE. Prints stats to stdout.
optional arguments:
-h, --help show this help message and exit
-s SNP, --snp SNP .snp file with positions (eigenstrat format)
-i IND, --ind IND .ind file of IDS (eigenstrat format)
-a ADMIX, --admix ADMIX
.geno file for admixed/test population (eigenstrat
format)
-r REFERENCE, --reference REFERENCE
.geno file for reference population (eigenstrat
format).
-c CHROM, --chrom CHROM
Chromosome
-b BEGIN, --begin BEGIN
Beginning position
-e END, --end END End position
-w WINDOW, --window WINDOW
Window length (eg. 50000)
-z STEP, --step STEP Step size for the window (eg. 10000)
ArchIE requires the following dependencies:
- Python 3
- Numpy (1.13.0)
- Scipy (0.19.0)
- Scikit Learn (0.19.1)
- R (3.4.0)
Note that other versions may work -- these are the ones I used.
ArchIE uses the eigenstrat format, which contains .snp
(list of polymorphic positions), .geno
(genotypes), and .ind
(individual IDs) files. ArchIE expects phased genotypes. You can convert from VCF format to eigenstrat using Iain Mathieson's genetic data converter. A version of the script is included in the data
folder which preserves phase information during the conversion (which is required to run ArchIE). If you use your own script for this conversion, the output should be in the following order: Ind1-Hap1 Ind1-Hap2 Ind2-Hap1 Ind2-Hap2...
.
To calculate features on your data, use the calc_stats_window_data.py
script in the data folder.
When running ArchIE, you can specify the number of haploid genomes to simulate (n), the window size, or the demographic history (in ms.sh
). You must simulate the same number of haploid genomes as are in your data set. E.g if I have 15 diploid individuals with phased genomes, I need to simulate a dataset with 30 haploid genomes.
We use several features to predict local ancestry. When the number of haplotypes simulated is 100 (the default) this is what the columns describe:
- 1-100 The individual frequency spectrum (sample size dependent)
- 101-201 - the distribution of distances between haplotypes (sample size dependent)
- 202 - Mean of the distribution of distances between haplotypes
- 203 - Variance of the distribution of distances between haplotypes
- 204 - Skew of the distribution of distances between haplotypes
- 205 - Kurtosis of the distribution of distances between haplotypes
- 206 - Minimum distance to the reference population
- 207 - S*
- 208 - Number of SNPs private to the target population
- 209-211 one hot encoded label for [archaic, not archaic, in-between]. 'In-between' is chosen if haplotypes fall between 30-70% archaic. This can be modified in the
calc_stats_ms.py
script - 212 - Proportion of the haplotype that is archaic ranges [0-1]
The order of the haplotypes remains the same if the sample size is changed, but the column numbers will change.
We use a customized version of ms
that allows us to track the ancestry of each SNP. Check around line 218 in ms.c for the modifications. ms
was compiled on Mac OSX, if you run into problems you can recompile it with gcc -o ms ms.c streec.c rand1.c -lm