-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcf_to_zarr
creates zero-sized first chunk which results in incorrect dtype.
#1090
Comments
I've been unable to recreate this with large headers locally, so maybe I'm barking up the wrong tree. Will have to dig further with the actual problematic VCF. This VCF is restricted to a private cluster, but luckily I have access. |
I wonder if this is related to pydata/xarray#7328? |
I can get the wrong
In [1]: import sgkit as sg
In [2]: from sgkit.io.vcf import vcf_to_zarr
In [3]: vcf_to_zarr("empty.vcf", "empty.zarr")
In [4]: sg.load_dataset("empty.zarr").load()
Out[4]:
<xarray.Dataset>
Dimensions: (contigs: 1, filters: 1, samples: 0, variants: 0,
alleles: 4)
Dimensions without coordinates: contigs, filters, samples, variants, alleles
Data variables:
contig_id (contigs) <U1 '0'
filter_id (filters) object 'PASS'
sample_id (samples) float64
variant_allele (variants, alleles) float64
variant_contig (variants) int8
variant_filter (variants, filters) bool
variant_id (variants) float64
variant_id_mask (variants) bool
variant_position (variants) int32
variant_quality (variants) float32
Attributes:
contigs: ['0']
filters: ['PASS']
max_alt_alleles_seen: 0
source: sgkit-0.6.1.dev2+gcc728043
vcf_header: ##fileformat=VCFv4.3\n##FILTER=<ID=PASS,Descriptio...
vcf_zarr_version: 0.2 Note that The In [5]: import zarr
In [6]: zarr.open("empty.zarr/variant_allele")
Out[6]: <zarr.core.Array (0, 4) float64> |
Ah-ha. So maybe the first chunk just happens to have no variants as that's where the tabix made the cut for a 20MB chunk? I'll try to reproduce that with the actual VCF. |
That sounds plausible. Can you post more of the stacktrace you are getting as it's hard to see where the problem is occurring in sgkit. I wonder if we could do a short-term fix in sgkit where we ignore zero-sized arrays in Longer-term I'd like this to be addressed in Xarray. There's work happening in this area of the code (e.g. pydata/xarray#7654), so we might want to get involved with that. |
The incorrect dtype is set at io/utils.py:109 where the first chunk's dtype is used. The zero-sised chunks would have to be filtered out so they don't get passed to |
Xarray fix being worked on at pydata/xarray#7862 |
pydata/xarray#7862 has been merged, so we can make changes here to take advantage of it. |
@tnguyengel has hit the following error while running
vcf_to_zarr
with the default arguments:This is because
concat_zarrs_optimized
is usingdtype=float64
to concat and convert thevariant_alleles
array.This is because the first temp zarr chunk has a
variant_allele
dtype offloat64
This is because the first temp zarr chunk is zero-sized.
I assume this is because the
target_chunk_size
default of20M
is smaller than the VCF header, leading to no sites being in the first chunk. I have asked her to try a largertarget_chunk_size
as a workaround, and will work on a proper fix.The text was updated successfully, but these errors were encountered: