[Docs] Add ParquetDataset document. (DeepRec-AI#304)

huangjin1995 · Jul 8, 2022 · 22c564b · 22c564b
1 parent 468115b
commit 22c564b
Show file tree

Hide file tree

Showing 2 changed files with 115 additions and 0 deletions.
diff --git a/docs/ParquetDataset.md b/docs/ParquetDataset.md
@@ -0,0 +1,114 @@
+# ParquetDataset
+
+## 功能
+
+1. parquet dataset支持从parquet文件中读取数据
+2. parquet dataset支持从本地以及S3/OSS/HDFS文件系统中读取对应parquet文件
+
+## 接口介绍
+
+### API说明
+
+```python
+class ParquetDataset(dataset_ops.DatasetV2):
+  def __init__(
+      self, filenames,
+      batch_size=1,
+      fields=None,
+      partition_count=1,
+      partition_index=0,
+      drop_remainder=False,
+      num_parallel_reads=None,
+      num_sequential_reads=1):
+
+# Create a `ParquetDataset` from filenames dataset.
+def read_parquet(
+    batch_size,
+    fields=None,
+    partition_count=1,
+    partition_index=0,
+    drop_remainder=False,
+    num_parallel_reads=None,
+    num_sequential_reads=1):
+```
+
+### 参数说明
+
+- filenames: A 0-D or 1-D `tf.string` tensor containing one or more filenames.
+- batch_size: (Optional.) Maxium number of samples in an output batch.
+- fields: (Optional.) List of DataFrame fields.
+- partition_count: (Optional.) Count of row group partitions.
+- partition_index: (Optional.) Index of row group partitions.
+- drop_remainder: (Optional.) If True, only keep batches with exactly `batch_size` samples.
+- num_parallel_reads: (Optional.) A `tf.int64` scalar representing the number of files to read in parallel. Defaults to reading files sequentially.
+- num_sequential_reads: (Optional.) A `tf.int64` scalar representing the number of batches to read in sequential. Defaults to 1.
+
+## 使用示例
+
+### 1. Example: Read from one file on local filesystem
+```python
+import tensorflow as tf
+from tensorflow.python.data.experimental.ops import parquet_dataset_ops
+
+# Read from a parquet file.
+ds = parquet_dataset_ops.ParquetDataset('/path/to/f1.parquet',
+                                        batch_size=1024)
+ds = ds.prefetch(4)
+it = tf.data.make_one_shot_iterator(ds)
+batch = it.get_next()
+# {'a': tensora, 'c': tensorc}
+```
+### 2. Example: Read from filenames dataset
+```python
+import tensorflow as tf
+from tensorflow.python.data.experimental.ops import parquet_dataset_ops
+
+filenames = tf.data.Dataset.from_generator(func, tf.string, tf.TensorShape([]))
+# Define data frame fields.
+fields = [
+    parquet_dataset_ops.DataFrame.Field('A', tf.int64),
+    parquet_dataset_ops.DataFrame.Field('C', tf.int64, ragged_rank=1)]
+# Read from parquet files by reading upstream filename dataset.
+ds = filenames.apply(parquet_dataset_ops.read_parquet(1024, fields=fields))
+ds = ds.prefetch(4)
+it = tf.data.make_one_shot_iterator(ds)
+batch = it.get_next()
+# {'a': tensora, 'c': tensorc}
+...
+```
+### 3. Example: Read from files on S3/OSS/HDFS
+
+```bash
+export S3_ENDPOINT=oss-cn-shanghai-internal.aliyuncs.com
+export AWS_ACCESS_KEY_ID=my_id
+export AWS_SECRET_ACCESS_KEY=my_secret
+export S3_ADDRESSING_STYLE=virtual
+```
+
+```{eval-rst}
+.. note::
+   See https://docs.w3cub.com/tensorflow~guide/deploy/s3.html for more
+   information.
+.. note::
+   Set `S3_ADDRESSING_STYLE` to `virtual` to support OSS.
+.. note::
+   Set `S3_USE_HTTPS` to `0` to use `http` for S3 endpoint.
+```
+
+```python
+import tensorflow as tf
+from tensorflow.python.data.experimental.ops import parquet_dataset_ops
+
+# Read from parquet files on remote services for selected fields.
+ds = parquet_dataset_ops.ParquetDataset(
+    ['s3://path/to/f1.parquet',
+     'oss://path/to/f2.parquet',
+     'hdfs://host:port/path/to/f3.parquet'],
+    batch_size=1024,
+    fields=['a', 'c'])
+ds = ds.prefetch(4)
+it = tf.data.make_one_shot_iterator(ds)
+batch = it.get_next()
+# {'a': tensora, 'c': tensorc}
+...
+```
diff --git a/docs/index.md b/docs/index.md
@@ -105,6 +105,7 @@ BFloat16
 WorkQueue
 KafkaDataset
 KafkaGroupIODataset
+ParquetDataset
 ```
 
 ```{toctree}