Create single Zarr for all example chromosomes #1

hammer · 2024-01-06T15:53:54Z

This code works until the last line when I get:

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'call_genotype_mask' has incompatible dask chunks: ((111848, 111848, 111848, 111848, 86140, 111848, 111848, 111848, 111848, 111848, 1740, 111848, 111848, 111848, 111848, 26955, 111848, 111848, 111848, 111848, 47147, 111848, 111848, 111848, 82350, 111848, 111848, 111848, 111848, 20278, 111848, 111848, 111848, 64400, 111848, 111848, 111848, 26961, 111848, 111848, 85885, 111848, 111848, 111848, 26017, 111848, 111848, 110092, 111848, 111848, 100354, 111848, 111848, 30582, 111848, 111848, 5840, 111848, 93641, 111848, 100039, 111848, 79972, 111848, 85546, 111848, 51566, 111848, 42140, 109673, 111848, 4676), (600,), (2,)). Consider rechunking using `chunk()`.

# Get a connection to GCS
project = 'sgkit-on-coiled'
token_path = '/home/codespace/.config/gcloud/application_default_credentials.json'
gcs = gcsfs.GCSFileSystem(project=project, token=token_path)

for chr_num in range(1,23):
    file_base = 'synthetic_small_v1_chr-'+str(chr_num)

    # Files to get from GCS
    files = [
        file_base + '.bed',
        file_base + '.bim',
        file_base + '.fam',
    ]

    # Local directory to put them in
    data_dir = '../data/'
    local_dir = data_dir + file_base
    local_plink_path = local_dir+'/'+file_base # read_plink adds .bed, .bim, .fam

    # Move PLINK files from GCS to the local fs 
    for f in files:
        print('Downloading', f)
        gcs.get('hapnest/example/'+f, local_dir+'/'+f)

    # Read local PLINK file
    new_chr_ds = plink.read_plink(path=local_plink_path)

    # Concatenate new chromosome to single dataset
    if chr_num == 1:
        ds = new_chr_ds
    else:
        ds = concat_chrs(ds, new_chr_ds)

# Write ds to GCS as Zarr
remote_zarr_path = 'gs://hapnest/example/synthetic_small_v1_chr-all.zarr'
ds.to_zarr(gcs.get_mapper(remote_zarr_path), mode='w') # do I need get_mapper?

The text was updated successfully, but these errors were encountered:

hammer · 2024-01-06T16:15:25Z

This works until the very end when the kernel crashes. I think I'm consuming too much memory? Who knows.

# PyData
import numpy as np
import pandas as pd
import xarray as xr
import dask.dataframe as dd
import dask.array as da
xr.set_options(display_expand_attrs=False, display_expand_data_vars=True);

# Google Cloud
import gcsfs

# sgkit
import sgkit as sg
from sgkit.io import plink

def concat_chrs(chr1, chr2):
    new_ds_dict = {}
    
    # Concatenate contig_id and increment chr2 variant_contig indexes by chr1.contigs.size
    new_ds_dict['contig_id'] = xr.concat([chr1.contig_id, chr2.contig_id], dim='contigs')
    new_ds_dict['variant_contig'] = xr.concat([chr1.variant_contig, chr2.variant_contig + chr1.contigs.size], dim='variants')

    # Concatenate remaining variant data variables
    data_vars_variants = [
        'call_genotype',
        'call_genotype_mask',
        'variant_allele',
        'variant_id',
        'variant_position',
        ]
    for dv in data_vars_variants:
        new_dv = xr.concat([chr1[dv], chr2[dv]], dim='variants')
        new_dv.data = new_dv.data.rechunk()
        new_ds_dict[dv] = new_dv

    # Copy over sample data variables from chr1
    data_vars_samples = [
        'sample_family_id',
        'sample_id',
        'sample_maternal_id',
        'sample_member_id',
        'sample_paternal_id',
        'sample_phenotype',
        'sample_sex',
    ]
    for dv in data_vars_samples:
        new_ds_dict[dv] = chr1[dv]

    return xr.Dataset(data_vars=new_ds_dict)  

# Get a connection to GCS
project = 'sgkit-on-coiled'
token_path = '/home/codespace/.config/gcloud/application_default_credentials.json'
gcs = gcsfs.GCSFileSystem(project=project, token=token_path)

for chr_num in range(1,23):
    file_base = 'synthetic_small_v1_chr-'+str(chr_num)

    # Files to get from GCS
    files = [
        file_base + '.bed',
        file_base + '.bim',
        file_base + '.fam',
    ]

    # Local directory to put them in
    data_dir = '../data/'
    local_dir = data_dir + file_base
    local_plink_path = local_dir+'/'+file_base # read_plink adds .bed, .bim, .fam

    # Move PLINK files from GCS to the local fs 
    for f in files:
        print('Downloading', f)
        gcs.get('hapnest/example/'+f, local_dir+'/'+f)

    # Read local PLINK file
    new_chr_ds = plink.read_plink(path=local_plink_path)

    # Concatenate new chromosome to single dataset
    if chr_num == 1:
        ds = new_chr_ds
    else:
        ds = concat_chrs(ds, new_chr_ds)

# Write ds to GCS as Zarr
remote_zarr_path = 'gs://hapnest/example/synthetic_small_v1_chr-all.zarr'
ds.to_zarr(gcs.get_mapper(remote_zarr_path), mode='w') # do I need get_mapper?

hammer · 2024-01-06T16:50:34Z

Update: running on a bigger box makes it work! Love that for me.

hammer · 2024-01-06T17:07:08Z

I should probably put the sample phenotypes into this dataset as well...

hammer · 2024-01-06T17:14:41Z

As a sanity check I looked at the Zarr file size. As a reminder, we have 600 subjects with 6,874,394 variants.

$ gsutil du -h -s gs://hapnest/example/synthetic_small_v1_chr-all.zarr
1.9 GiB

When I add up the file sizes for HAPNEST, I get 1.19 GiB (I think, I always mess up GB vs GiB).

Is this expansion in file size expected?

hammer mentioned this issue Jan 6, 2024

Statistical genetics section sgkit-dev/sgkit-publication#78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create single Zarr for all example chromosomes #1

Create single Zarr for all example chromosomes #1

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024 •

edited

Loading

Create single Zarr for all example chromosomes #1

Create single Zarr for all example chromosomes #1

Comments

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024

hammer commented Jan 6, 2024 • edited Loading

hammer commented Jan 6, 2024 •

edited

Loading