`vcf_to_zarr` creates zero-sized first chunk which results in incorrect dtype. #1090

benjeffery · 2023-05-18T12:56:16Z

@tnguyengel has hit the following error while running vcf_to_zarr with the default arguments:

  File "/home/tnguyen/conda/sgkit_main/lib/python3.10/site-packages/zarr/core.py", line 2168, in _process_for_setitem
    chunk = value.astype(self._dtype, order=self._order, copy=False)
ValueError: could not convert string to float: 'A'

This is because concat_zarrs_optimized is using dtype=float64 to concat and convert the variant_alleles array.
This is because the first temp zarr chunk has a variant_allele dtype of float64
This is because the first temp zarr chunk is zero-sized.

I assume this is because the target_chunk_size default of 20M is smaller than the VCF header, leading to no sites being in the first chunk. I have asked her to try a larger target_chunk_size as a workaround, and will work on a proper fix.

The text was updated successfully, but these errors were encountered:

benjeffery · 2023-05-18T22:09:07Z

I've been unable to recreate this with large headers locally, so maybe I'm barking up the wrong tree. Will have to dig further with the actual problematic VCF. This VCF is restricted to a private cluster, but luckily I have access.

tomwhite · 2023-05-19T08:37:38Z

I wonder if this is related to pydata/xarray#7328?

tomwhite · 2023-05-19T10:57:03Z

I can get the wrong dtype by using a VCF with no variants:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=0>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

In [1]: import sgkit as sg

In [2]: from sgkit.io.vcf import vcf_to_zarr

In [3]: vcf_to_zarr("empty.vcf", "empty.zarr")

In [4]: sg.load_dataset("empty.zarr").load()
Out[4]: 
<xarray.Dataset>
Dimensions:           (contigs: 1, filters: 1, samples: 0, variants: 0,
                       alleles: 4)
Dimensions without coordinates: contigs, filters, samples, variants, alleles
Data variables:
    contig_id         (contigs) <U1 '0'
    filter_id         (filters) object 'PASS'
    sample_id         (samples) float64 
    variant_allele    (variants, alleles) float64 
    variant_contig    (variants) int8 
    variant_filter    (variants, filters) bool 
    variant_id        (variants) float64 
    variant_id_mask   (variants) bool 
    variant_position  (variants) int32 
    variant_quality   (variants) float32 
Attributes:
    contigs:               ['0']
    filters:               ['PASS']
    max_alt_alleles_seen:  0
    source:                sgkit-0.6.1.dev2+gcc728043
    vcf_header:            ##fileformat=VCFv4.3\n##FILTER=<ID=PASS,Descriptio...
    vcf_zarr_version:      0.2

Note that variant_allele is float64.

The dtype is float64 on the Zarr array too:

In [5]: import zarr

In [6]: zarr.open("empty.zarr/variant_allele")
Out[6]: <zarr.core.Array (0, 4) float64>

benjeffery · 2023-05-19T12:09:14Z

Ah-ha. So maybe the first chunk just happens to have no variants as that's where the tabix made the cut for a 20MB chunk? I'll try to reproduce that with the actual VCF.

tomwhite · 2023-05-22T14:05:52Z

Ah-ha. So maybe the first chunk just happens to have no variants as that's where the tabix made the cut for a 20MB chunk? I'll try to reproduce that with the actual VCF.

That sounds plausible. Can you post more of the stacktrace you are getting as it's hard to see where the problem is occurring in sgkit.

I wonder if we could do a short-term fix in sgkit where we ignore zero-sized arrays in concat_zarrs_optimized (or concatenate_and_rechunk) so there's no problem with the float/string type mismatch?

Longer-term I'd like this to be addressed in Xarray. There's work happening in this area of the code (e.g. pydata/xarray#7654), so we might want to get involved with that.

benjeffery · 2023-05-22T14:59:57Z

That sounds plausible. Can you post more of the stacktrace you are getting as it's hard to see where the problem is occurring in sgkit.

The incorrect dtype is set at io/utils.py:109 where the first chunk's dtype is used. The zero-sised chunks would have to be filtered out so they don't get passed to concatenate_and_rechunk at vcfzarr_reader.py:353. I can write up a PR for this.

tomwhite · 2023-06-05T13:53:14Z

Xarray fix being worked on at pydata/xarray#7862

tomwhite · 2023-06-16T10:28:58Z

pydata/xarray#7862 has been merged, so we can make changes here to take advantage of it.

hammer added the bug Something isn't working label May 18, 2023

hammer added the upstream Used when our build breaks due to upstream changes label May 19, 2023

benjeffery added this to the 0.7.1 release milestone Dec 1, 2023

jeromekelleher assigned benjeffery Dec 4, 2023

jeromekelleher modified the milestones: 0.8.0 release, 0.8.1 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vcf_to_zarr` creates zero-sized first chunk which results in incorrect dtype. #1090

`vcf_to_zarr` creates zero-sized first chunk which results in incorrect dtype. #1090

benjeffery commented May 18, 2023

benjeffery commented May 18, 2023

tomwhite commented May 19, 2023

tomwhite commented May 19, 2023

benjeffery commented May 19, 2023 •

edited

Loading

tomwhite commented May 22, 2023

benjeffery commented May 22, 2023

tomwhite commented Jun 5, 2023

tomwhite commented Jun 16, 2023

vcf_to_zarr creates zero-sized first chunk which results in incorrect dtype. #1090

vcf_to_zarr creates zero-sized first chunk which results in incorrect dtype. #1090

Comments

benjeffery commented May 18, 2023

benjeffery commented May 18, 2023

tomwhite commented May 19, 2023

tomwhite commented May 19, 2023

benjeffery commented May 19, 2023 • edited Loading

tomwhite commented May 22, 2023

benjeffery commented May 22, 2023

tomwhite commented Jun 5, 2023

tomwhite commented Jun 16, 2023

`vcf_to_zarr` creates zero-sized first chunk which results in incorrect dtype. #1090

`vcf_to_zarr` creates zero-sized first chunk which results in incorrect dtype. #1090

benjeffery commented May 19, 2023 •

edited

Loading