Apache ORC Support in TensorFlow IO

(Creating this issue for visibility so people interested can join the discussion... )
## Overview 
Load Apache ORC formatted data natively into TensorFlow from file system supported by TensorFlow, e.g. HDFS, local disk, etc.

## Motivation
We traditionally use Avro to store our dataset but it is becoming inefficient to use row based format for big data analytics processing. Historically we selected ORC as our columnar storage format. (not planning to argue Parquet vs ORC here ;))

## Design Discussions
- Apache ORC would be brought in via https://github.com/bazelbuild/rules_foreign_cc
- Feature wise, I expect the APIs to be similar to Parquet or [Arrow reader](https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f).

## Milestones
- [x] Add Apache ORC build dependency.
- [x] Implement a simple ORC dataset that maps records in ORC files into Tensors.
- [x] add a tutorial for ORC reader.
- [ ] feature schemas support: support sparseTensor and VarLenFeature.
- [ ] feature schemas support: support denseTensor FixedLenFeature only. (follow `parse_example_v2`.)
- [ ] usability improvements
- [ ] performance tuning
- [ ] feature schemas support: support raggedTensor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apache ORC Support in TensorFlow IO #1372

Overview

Motivation

Design Discussions

Milestones

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Apache ORC Support in TensorFlow IO #1372

Description

Overview

Motivation

Design Discussions

Milestones

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions