We can currently get the 1000 genomes public dataset from aws s3. I used a command line tool called s3cmd.
s3cmd ls s3://1000genomes/
s3cmd get s3://1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz
We will also want the call samples list, which will tell us which population each sample ID belongs to:
s3cmd get s3://1000genomes/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
We need to unzip our data, and get in onto hdfs:
gunzip ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz
./bin/hadoop fs -put /vol0/data/ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf /user/ds/genomics/1kg/vcf/
./bin/adam-submit vcf2adam /user/ds/genomics/1kg/vcf/ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf /user/ds/genomics/1kg/parquet/chr22