-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create single Zarr for all example chromosomes #1
Comments
This works until the very end when the kernel crashes. I think I'm consuming too much memory? Who knows. # PyData
import numpy as np
import pandas as pd
import xarray as xr
import dask.dataframe as dd
import dask.array as da
xr.set_options(display_expand_attrs=False, display_expand_data_vars=True);
# Google Cloud
import gcsfs
# sgkit
import sgkit as sg
from sgkit.io import plink
def concat_chrs(chr1, chr2):
new_ds_dict = {}
# Concatenate contig_id and increment chr2 variant_contig indexes by chr1.contigs.size
new_ds_dict['contig_id'] = xr.concat([chr1.contig_id, chr2.contig_id], dim='contigs')
new_ds_dict['variant_contig'] = xr.concat([chr1.variant_contig, chr2.variant_contig + chr1.contigs.size], dim='variants')
# Concatenate remaining variant data variables
data_vars_variants = [
'call_genotype',
'call_genotype_mask',
'variant_allele',
'variant_id',
'variant_position',
]
for dv in data_vars_variants:
new_dv = xr.concat([chr1[dv], chr2[dv]], dim='variants')
new_dv.data = new_dv.data.rechunk()
new_ds_dict[dv] = new_dv
# Copy over sample data variables from chr1
data_vars_samples = [
'sample_family_id',
'sample_id',
'sample_maternal_id',
'sample_member_id',
'sample_paternal_id',
'sample_phenotype',
'sample_sex',
]
for dv in data_vars_samples:
new_ds_dict[dv] = chr1[dv]
return xr.Dataset(data_vars=new_ds_dict)
# Get a connection to GCS
project = 'sgkit-on-coiled'
token_path = '/home/codespace/.config/gcloud/application_default_credentials.json'
gcs = gcsfs.GCSFileSystem(project=project, token=token_path)
for chr_num in range(1,23):
file_base = 'synthetic_small_v1_chr-'+str(chr_num)
# Files to get from GCS
files = [
file_base + '.bed',
file_base + '.bim',
file_base + '.fam',
]
# Local directory to put them in
data_dir = '../data/'
local_dir = data_dir + file_base
local_plink_path = local_dir+'/'+file_base # read_plink adds .bed, .bim, .fam
# Move PLINK files from GCS to the local fs
for f in files:
print('Downloading', f)
gcs.get('hapnest/example/'+f, local_dir+'/'+f)
# Read local PLINK file
new_chr_ds = plink.read_plink(path=local_plink_path)
# Concatenate new chromosome to single dataset
if chr_num == 1:
ds = new_chr_ds
else:
ds = concat_chrs(ds, new_chr_ds)
# Write ds to GCS as Zarr
remote_zarr_path = 'gs://hapnest/example/synthetic_small_v1_chr-all.zarr'
ds.to_zarr(gcs.get_mapper(remote_zarr_path), mode='w') # do I need get_mapper? |
Update: running on a bigger box makes it work! Love that for me. |
I should probably put the sample phenotypes into this dataset as well... |
As a sanity check I looked at the Zarr file size. As a reminder, we have 600 subjects with 6,874,394 variants. $ gsutil du -h -s gs://hapnest/example/synthetic_small_v1_chr-all.zarr
1.9 GiB When I add up the file sizes for HAPNEST, I get 1.19 GiB (I think, I always mess up GB vs GiB). Is this expansion in file size expected? |
This code works until the last line when I get:
The text was updated successfully, but these errors were encountered: