-
Notifications
You must be signed in to change notification settings - Fork 306
Description
Following the discussion on #366, using eager mode gives more flexibility to handle certain aspects that are common in columnar data, such as N/A or NULL values.
Previously, when we create a dataset, we create a dataset in one shot for every column. For example, for CsvDataset, we have to specify EVERY column before hand before it even runs.
The requirement was actually because we used to implement against 1.13/1.14 where TF Graph is static. So we need to know EVERYTHING before hand, in order to run the graph (or pass to tf.keras).
Now as we move to TF 2.0, knowing everything before hand is not necessarily anymore. We could just parse the file and find the meta data in eager mode, then build the dataset to pass to tf.keras.
In this situation, I am wondering if it makes sense to focus on "building dataset with one column at a time"? Something like:
# read parquet file and find all columns
# then build multiple datasets and build one dataset at a time for each column
dataset = zip([ParquetDataset(filename, column) for column in columns])The reason, is that, when we try to build a dataset from ALL columns, we assume all columns should have the same number of records. But this is not the case for many files such as HDFS or Feather (if I understand correctly).
I noticed this issue when I tries to play with pandas. Just realized in our current implementation, it is hard to apply NA or null field.
But with TF 2.0 and eager execution, we actually have more freedom to handle those situations. For example, we could do additional bit masking before merge different columns.
From that standpoint, maybe it makes more sense to focus on building a dataset with only one column at a time?