-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Milestone
Description
@pcarbo and I have decided to give HDF5 a stab as replacement to current default RDS storage format. We start from R and Python. The basic data types we'd like to support are:
| HDF5 | R | Python |
|---|---|---|
| ? | character | str |
| ? | integer | int, np.int*, np.uint* |
| ? | double | float, np.float* |
| ? | vector | list, np.array |
| ? | matrix | np.matrix |
| ? | array | np.array, list of lists |
| ? | data.frame | pd.DataFrame |
| ? | NaN | np.nan |
| ? | Na | None |
np for numpy, pd for pandas. Here is a test on Python's end:
import numpy as np
import pandas as pd
data = {'charater': 'pcarbo',
'integer1': 1, 'integer2': np.uint8(1),
'double1': 1.0, 'double2': np.float16(1.0),
'vector1': [1,2,'gaow'], 'vector2': [1,2,3], 'vector3': np.array([1,2,3]),
'matrix': np.matrix([[1,2],[3,4]]),
'array1': np.array([[1,2],[3,4]]), 'array2': [[1,2],[3,4]],
'dataframe': pd.DataFrame({'A': [1,2], 'B': [3,4]}, index=['row1', 'row2'])
}
data['recursive'] = dataHere is the outcome in HDF5:
I used this API from UChicago:
https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py
But it would not be difficult, I presume, to customize.
A particular difficult case is NULL/NA/NaN in R. In Python there are only None and NaN, no NULL. #25
@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.