Skip to content

Issue of vcf_to_zarr()'s .zarr output #785

@LiangdeLI

Description

@LiangdeLI

The .zarr output of vcf_to_zarr() is smaller than before in terms of size, like for a 201MB 1000 Genomes chromosome 21 vcf file, the new result is 50MB zarr file while it was 321MB before. However, I found some problems of this new compressed version of .zarr.

For example, if I simply run the following reading data and write:

ds = sg.load_dataset("/mydata/chr21.zarr")
ds[['variant_id']].to_zarr("/mydata/output.zarr", compute=True)

It works for the old 321MB version, but the new 50MB version would raise error:

/users/Liangde/.local/lib/python3.8/site-packages/xarray/conventions.py:201: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 2031, in to_zarr
    return to_zarr(
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/api.py", line 1414, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/api.py", line 1124, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/zarr.py", line 555, in store
    self.set_variables(
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/zarr.py", line 634, in set_variables
    writer.add(v.data, zarr_array, region)
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in add
    target[region] = source
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1213, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1308, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1599, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1651, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1888, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1893, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1952, in _process_for_setitem
    return self._encode_chunk(chunk)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 2001, in _encode_chunk
    chunk = f.encode(chunk)
  File "numcodecs/vlen.pyx", line 106, in numcodecs.vlen.VLenUTF8.encode
TypeError: expected unicode string, found 16

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions