Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf_to_zarr creates zero-sized first chunk which results in incorrect dtype. #1090

Open
benjeffery opened this issue May 18, 2023 · 8 comments
Assignees
Labels
bug Something isn't working upstream Used when our build breaks due to upstream changes
Milestone

Comments

@benjeffery
Copy link
Collaborator

@tnguyengel has hit the following error while running vcf_to_zarr with the default arguments:

  File "/home/tnguyen/conda/sgkit_main/lib/python3.10/site-packages/zarr/core.py", line 2168, in _process_for_setitem
    chunk = value.astype(self._dtype, order=self._order, copy=False)
ValueError: could not convert string to float: 'A'

This is because concat_zarrs_optimized is using dtype=float64 to concat and convert the variant_alleles array.
This is because the first temp zarr chunk has a variant_allele dtype of float64
This is because the first temp zarr chunk is zero-sized.

I assume this is because the target_chunk_size default of 20M is smaller than the VCF header, leading to no sites being in the first chunk. I have asked her to try a larger target_chunk_size as a workaround, and will work on a proper fix.

@hammer hammer added the bug Something isn't working label May 18, 2023
@benjeffery
Copy link
Collaborator Author

I've been unable to recreate this with large headers locally, so maybe I'm barking up the wrong tree. Will have to dig further with the actual problematic VCF. This VCF is restricted to a private cluster, but luckily I have access.

@tomwhite
Copy link
Collaborator

I wonder if this is related to pydata/xarray#7328?

@tomwhite
Copy link
Collaborator

I can get the wrong dtype by using a VCF with no variants:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=0>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
In [1]: import sgkit as sg

In [2]: from sgkit.io.vcf import vcf_to_zarr

In [3]: vcf_to_zarr("empty.vcf", "empty.zarr")

In [4]: sg.load_dataset("empty.zarr").load()
Out[4]: 
<xarray.Dataset>
Dimensions:           (contigs: 1, filters: 1, samples: 0, variants: 0,
                       alleles: 4)
Dimensions without coordinates: contigs, filters, samples, variants, alleles
Data variables:
    contig_id         (contigs) <U1 '0'
    filter_id         (filters) object 'PASS'
    sample_id         (samples) float64 
    variant_allele    (variants, alleles) float64 
    variant_contig    (variants) int8 
    variant_filter    (variants, filters) bool 
    variant_id        (variants) float64 
    variant_id_mask   (variants) bool 
    variant_position  (variants) int32 
    variant_quality   (variants) float32 
Attributes:
    contigs:               ['0']
    filters:               ['PASS']
    max_alt_alleles_seen:  0
    source:                sgkit-0.6.1.dev2+gcc728043
    vcf_header:            ##fileformat=VCFv4.3\n##FILTER=<ID=PASS,Descriptio...
    vcf_zarr_version:      0.2

Note that variant_allele is float64.

The dtype is float64 on the Zarr array too:

In [5]: import zarr

In [6]: zarr.open("empty.zarr/variant_allele")
Out[6]: <zarr.core.Array (0, 4) float64>

@hammer hammer added the upstream Used when our build breaks due to upstream changes label May 19, 2023
@benjeffery
Copy link
Collaborator Author

benjeffery commented May 19, 2023

Ah-ha. So maybe the first chunk just happens to have no variants as that's where the tabix made the cut for a 20MB chunk? I'll try to reproduce that with the actual VCF.

@tomwhite
Copy link
Collaborator

Ah-ha. So maybe the first chunk just happens to have no variants as that's where the tabix made the cut for a 20MB chunk? I'll try to reproduce that with the actual VCF.

That sounds plausible. Can you post more of the stacktrace you are getting as it's hard to see where the problem is occurring in sgkit.

I wonder if we could do a short-term fix in sgkit where we ignore zero-sized arrays in concat_zarrs_optimized (or concatenate_and_rechunk) so there's no problem with the float/string type mismatch?

Longer-term I'd like this to be addressed in Xarray. There's work happening in this area of the code (e.g. pydata/xarray#7654), so we might want to get involved with that.

@benjeffery
Copy link
Collaborator Author

That sounds plausible. Can you post more of the stacktrace you are getting as it's hard to see where the problem is occurring in sgkit.

The incorrect dtype is set at io/utils.py:109 where the first chunk's dtype is used. The zero-sised chunks would have to be filtered out so they don't get passed to concatenate_and_rechunk at vcfzarr_reader.py:353. I can write up a PR for this.

@tomwhite
Copy link
Collaborator

tomwhite commented Jun 5, 2023

Xarray fix being worked on at pydata/xarray#7862

@tomwhite
Copy link
Collaborator

pydata/xarray#7862 has been merged, so we can make changes here to take advantage of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream Used when our build breaks due to upstream changes
Projects
None yet
Development

No branches or pull requests

4 participants