Skip to content

New format on DSC data #86

@gaow

Description

@gaow

@pcarbo and I have decided to give HDF5 a stab as replacement to current default RDS storage format. We start from R and Python. The basic data types we'd like to support are:

HDF5 R Python
? character str
? integer int, np.int*, np.uint*
? double float, np.float*
? vector list, np.array
? matrix np.matrix
? array np.array, list of lists
? data.frame pd.DataFrame
? NaN np.nan
? Na None

np for numpy, pd for pandas. Here is a test on Python's end:

import numpy as np
import pandas as pd
data = {'charater': 'pcarbo', 
        'integer1': 1, 'integer2': np.uint8(1), 
        'double1': 1.0, 'double2': np.float16(1.0), 
        'vector1': [1,2,'gaow'], 'vector2': [1,2,3], 'vector3': np.array([1,2,3]),
        'matrix': np.matrix([[1,2],[3,4]]),
        'array1': np.array([[1,2],[3,4]]), 'array2': [[1,2],[3,4]],
        'dataframe': pd.DataFrame({'A': [1,2], 'B': [3,4]}, index=['row1', 'row2'])
       }
data['recursive'] = data

Here is the outcome in HDF5:

test.h5.zip

I used this API from UChicago:

https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py

But it would not be difficult, I presume, to customize.

A particular difficult case is NULL/NA/NaN in R. In Python there are only None and NaN, no NULL. #25

@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions