Skip to content

Standardize columnized dataset? #315

@yongtang

Description

@yongtang

With the upcoming DatasetV2 a lot of the APIs are getting simplified. That also opens up some additional possibilities than just passing the dataset to tf.keras.

One area of interest, is that we already have support for many columnized dataset, e.g, Arrow, Avro, Parquet, Json, HDF5, etc. Those dataset may potentially be standardized with the same API so that we could treat them homogeneously. For example, ArrowDataset already exposes a columns() property method. We could apply the same to Avro, Parquet, Json, HDF5 etc. Thought?

Since those columnized dataset are largely numeric values, I think one area we also could have a common base class for those dataset, and support additional operations. For example, dataset_1 + dataset_2 => dataset_3 (add) where dataset_3 could be passed to tf.keras. The implementation could start with zip + map in python (not even needed in C++). Maybe this could be one use case that will help users?

/cc @terrytangyuan @BryanCutler

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions