-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate ragged arrays to represent alleles #634
Comments
Tricky one. I guess an option we should consider also is a home-grown ragged array encoding for alleles (although it would be pretty icky to expose users to this). Could we (in principle) do an encoding ourselves and expose the |
Yes, I think it would be worth trying this out to see if it's possible, and how it works in Xarray. |
FWIW we've been doing this in tskit like this. I think the underlying encoding (a data and offset array) is basically the same as what Arrow uses, and it works well in practise. |
FYI @tomwhite. Following today's "standup", "open sourcing" some of our internal comments about awkward array, these are likely outdated (they were made about a year ago) and might not fully apply in the sgkit context. It might be still useful tho. From @eric-czech: There aren't any docs for awkward-1.0 yet, just empty stubs at https://awkward-array.org, so I looked at the original project instead. Here are a few questions I wanted to answer:
It looks like dask.Bag (or even just delayed) wrapping awkward would make the most sense. I was hoping initially that it would be possible for awkward to act on dask arrays, but it just converts everything to numpy afaict: import pyarrow, awkward
import dask.array as da
arrow_buffer = pyarrow.ipc.open_file(open("tests/samples/exoplanets.arrow", "rb")).get_batch(0)
stars = awkward.fromarrow(arrow_buffer)
stars['id'] = da.arange(len(stars), chunks=3)
type(stars['id']) # -> numpy.ndarray That would then mean that we'd have to use the awkward api within partitions, which would be like using Pandas without Dask.DataFrame. Have you found a deeper way to integrate the two @ravwojdyla? This looks promising in 1.0: https://awkward-array.org/how-to-create-lazy-dask.html but I'm not sure what to expect once they add anything there.
I'm not sure on this one -- it looks we can get a import pyarrow, awkward
arrow_buffer = pyarrow.ipc.open_file(open("tests/samples/exoplanets.arrow", "rb")).get_batch(0)
stars = awkward.fromarrow(arrow_buffer)
stars.layout
layout
[ ()] Table(dec=layout[0], dist=layout[1], mass=layout[2], name=layout[3], planets=layout[4], ra=layout[5], radius=layout[6])
[ 0] ndarray(shape=2935, dtype=dtype('float64'))
[ 1] ndarray(shape=2935, dtype=dtype('float64'))
[ 2] ndarray(shape=2935, dtype=dtype('float64'))
[ 3] StringArray(content=layout[3, 0], generator=<function tostring at 0x11510bd30>, args=(<function decode at 0x1011ae9d0>,), kwargs={})
[ 3, 0] JaggedArray(starts=layout[3, 0, 0], stops=layout[3, 0, 1], content=layout[3, 0, 2])
[ 3, 0, 0] ndarray(shape=2935, dtype=dtype('int32'))
[ 3, 0, 1] ndarray(shape=2935, dtype=dtype('int32'))
... I imagine there's a function somewhere in the API that would let us get the schema like the one that is saved in HDF5 metadata, but I can't find it. import h5py, json
f = h5py.File("stars.hdf5", "w")
g = awkward.hdf5(f)
g['stars'] = stars
f.close()
f = h5py.File("stars.hdf5", "r")
json.loads(f["stars"]["schema.json"][:].tostring())['schema']
{'call': ['awkward', 'Table', 'frompairs'],
'args': [{'pairs': [['dec',
{'call': ['awkward', 'numpy', 'frombuffer'],
'args': [{'read': '1'}, {'dtype': 'float64'}, {'json': 2935, 'id': 2}],
'id': 1}],
['dist',
{'call': ['awkward', 'numpy', 'frombuffer'],
'args': [{'read': '3'}, {'dtype': 'float64'}, {'json': 2935, 'id': 4}],
'id': 3}],
['mass',
{'call': ['awkward', 'numpy', 'frombuffer'],
'args': [{'read': '5'}, {'dtype': 'float64'}, {'json': 2935, 'id': 6}],
'id': 5}],
...
I think this too only appears to be possible (like joins) by using awkward within to partitions to either directly create lists of python objects, or use Pandas as an intermediary, for generating individual dict objects or pandas dataframes that we then process with dask.Bag or dask.DataFrame. From me: I found a bit more documentation in the jupyter notebooks here. Regarding lazy arrays and Dask integration, I believe they are referring to the Regarding groupby/join - looking at how reducers were implemented in awkward, I am not sure we would be able to use awkward as a 1st class citizen in Dask without as you are saying - converting down to list/dict/etc. |
A different way to approach this problem is to export the fields that need ragged string arrays to a different storage backend, such as Parquet, then use tools to query that dataset to produce a list of IDs that can be used to restrict the array data in Zarr. (I suggested this in #1059.) I've explored this a bit more in this notebook: https://github.com/pystatgen/sgkit/blob/17a24cf8ad755bb499b3a4a0caeca15390ef1a7e/ragged.ipynb And I've also written a small prototype to export a couple of VCF fields to Parquet in pystatgen/sgkit@517a67e. This could be easily extended to export all VCF fixed and INFO fields, which are the ones that would be useful for filtering. Not sure what to do with this yet, just sharing as it might be useful. |
I've generalised the Parquet export code in #1083 |
Alleles are a challenge to represent efficiently in fixed-length arrays. There are a couple of problems:
Both these problems could be solved by using ragged arrays.
Zarr has support for ragged arrays, but these don't currently work with variable length strings (needed for alleles), and they don't fit the Xarray data model, which assumes fixed sized dimensions. There is a good discussion of the problem in pydata/xarray#4285, in the context of Awkward Array.
The text was updated successfully, but these errors were encountered: