Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create single Zarr for all example chromosomes #1

Open
hammer opened this issue Jan 6, 2024 · 4 comments
Open

Create single Zarr for all example chromosomes #1

hammer opened this issue Jan 6, 2024 · 4 comments

Comments

@hammer
Copy link
Owner

hammer commented Jan 6, 2024

This code works until the last line when I get:

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'call_genotype_mask' has incompatible dask chunks: ((111848, 111848, 111848, 111848, 86140, 111848, 111848, 111848, 111848, 111848, 1740, 111848, 111848, 111848, 111848, 26955, 111848, 111848, 111848, 111848, 47147, 111848, 111848, 111848, 82350, 111848, 111848, 111848, 111848, 20278, 111848, 111848, 111848, 64400, 111848, 111848, 111848, 26961, 111848, 111848, 85885, 111848, 111848, 111848, 26017, 111848, 111848, 110092, 111848, 111848, 100354, 111848, 111848, 30582, 111848, 111848, 5840, 111848, 93641, 111848, 100039, 111848, 79972, 111848, 85546, 111848, 51566, 111848, 42140, 109673, 111848, 4676), (600,), (2,)). Consider rechunking using `chunk()`.

# Get a connection to GCS
project = 'sgkit-on-coiled'
token_path = '/home/codespace/.config/gcloud/application_default_credentials.json'
gcs = gcsfs.GCSFileSystem(project=project, token=token_path)

for chr_num in range(1,23):
    file_base = 'synthetic_small_v1_chr-'+str(chr_num)

    # Files to get from GCS
    files = [
        file_base + '.bed',
        file_base + '.bim',
        file_base + '.fam',
    ]

    # Local directory to put them in
    data_dir = '../data/'
    local_dir = data_dir + file_base
    local_plink_path = local_dir+'/'+file_base # read_plink adds .bed, .bim, .fam

    # Move PLINK files from GCS to the local fs 
    for f in files:
        print('Downloading', f)
        gcs.get('hapnest/example/'+f, local_dir+'/'+f)

    # Read local PLINK file
    new_chr_ds = plink.read_plink(path=local_plink_path)

    # Concatenate new chromosome to single dataset
    if chr_num == 1:
        ds = new_chr_ds
    else:
        ds = concat_chrs(ds, new_chr_ds)

# Write ds to GCS as Zarr
remote_zarr_path = 'gs://hapnest/example/synthetic_small_v1_chr-all.zarr'
ds.to_zarr(gcs.get_mapper(remote_zarr_path), mode='w') # do I need get_mapper?
@hammer
Copy link
Owner Author

hammer commented Jan 6, 2024

This works until the very end when the kernel crashes. I think I'm consuming too much memory? Who knows.

# PyData
import numpy as np
import pandas as pd
import xarray as xr
import dask.dataframe as dd
import dask.array as da
xr.set_options(display_expand_attrs=False, display_expand_data_vars=True);

# Google Cloud
import gcsfs

# sgkit
import sgkit as sg
from sgkit.io import plink

def concat_chrs(chr1, chr2):
    new_ds_dict = {}
    
    # Concatenate contig_id and increment chr2 variant_contig indexes by chr1.contigs.size
    new_ds_dict['contig_id'] = xr.concat([chr1.contig_id, chr2.contig_id], dim='contigs')
    new_ds_dict['variant_contig'] = xr.concat([chr1.variant_contig, chr2.variant_contig + chr1.contigs.size], dim='variants')

    # Concatenate remaining variant data variables
    data_vars_variants = [
        'call_genotype',
        'call_genotype_mask',
        'variant_allele',
        'variant_id',
        'variant_position',
        ]
    for dv in data_vars_variants:
        new_dv = xr.concat([chr1[dv], chr2[dv]], dim='variants')
        new_dv.data = new_dv.data.rechunk()
        new_ds_dict[dv] = new_dv

    # Copy over sample data variables from chr1
    data_vars_samples = [
        'sample_family_id',
        'sample_id',
        'sample_maternal_id',
        'sample_member_id',
        'sample_paternal_id',
        'sample_phenotype',
        'sample_sex',
    ]
    for dv in data_vars_samples:
        new_ds_dict[dv] = chr1[dv]

    return xr.Dataset(data_vars=new_ds_dict)  

# Get a connection to GCS
project = 'sgkit-on-coiled'
token_path = '/home/codespace/.config/gcloud/application_default_credentials.json'
gcs = gcsfs.GCSFileSystem(project=project, token=token_path)

for chr_num in range(1,23):
    file_base = 'synthetic_small_v1_chr-'+str(chr_num)

    # Files to get from GCS
    files = [
        file_base + '.bed',
        file_base + '.bim',
        file_base + '.fam',
    ]

    # Local directory to put them in
    data_dir = '../data/'
    local_dir = data_dir + file_base
    local_plink_path = local_dir+'/'+file_base # read_plink adds .bed, .bim, .fam

    # Move PLINK files from GCS to the local fs 
    for f in files:
        print('Downloading', f)
        gcs.get('hapnest/example/'+f, local_dir+'/'+f)

    # Read local PLINK file
    new_chr_ds = plink.read_plink(path=local_plink_path)

    # Concatenate new chromosome to single dataset
    if chr_num == 1:
        ds = new_chr_ds
    else:
        ds = concat_chrs(ds, new_chr_ds)

# Write ds to GCS as Zarr
remote_zarr_path = 'gs://hapnest/example/synthetic_small_v1_chr-all.zarr'
ds.to_zarr(gcs.get_mapper(remote_zarr_path), mode='w') # do I need get_mapper?

@hammer
Copy link
Owner Author

hammer commented Jan 6, 2024

Update: running on a bigger box makes it work! Love that for me.

@hammer
Copy link
Owner Author

hammer commented Jan 6, 2024

I should probably put the sample phenotypes into this dataset as well...

@hammer
Copy link
Owner Author

hammer commented Jan 6, 2024

As a sanity check I looked at the Zarr file size. As a reminder, we have 600 subjects with 6,874,394 variants.

$ gsutil du -h -s gs://hapnest/example/synthetic_small_v1_chr-all.zarr
1.9 GiB

When I add up the file sizes for HAPNEST, I get 1.19 GiB (I think, I always mess up GB vs GiB).

Is this expansion in file size expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant