Skip to content

Apache ORC Support in TensorFlow IO #1372

Open
@oliverhu

Description

@oliverhu

(Creating this issue for visibility so people interested can join the discussion... )

Overview

Load Apache ORC formatted data natively into TensorFlow from file system supported by TensorFlow, e.g. HDFS, local disk, etc.

Motivation

We traditionally use Avro to store our dataset but it is becoming inefficient to use row based format for big data analytics processing. Historically we selected ORC as our columnar storage format. (not planning to argue Parquet vs ORC here ;))

Design Discussions

Milestones

  • Add Apache ORC build dependency.
  • Implement a simple ORC dataset that maps records in ORC files into Tensors.
  • add a tutorial for ORC reader.
  • feature schemas support: support sparseTensor and VarLenFeature.
  • feature schemas support: support denseTensor FixedLenFeature only. (follow parse_example_v2.)
  • usability improvements
  • performance tuning
  • feature schemas support: support raggedTensor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions