forked from DeepRec-AI/DeepRec
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Docs] Add ParquetDataset document. (DeepRec-AI#304)
- Loading branch information
1 parent
468115b
commit 22c564b
Showing
2 changed files
with
115 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# ParquetDataset | ||
|
||
## 功能 | ||
|
||
1. parquet dataset支持从parquet文件中读取数据 | ||
2. parquet dataset支持从本地以及S3/OSS/HDFS文件系统中读取对应parquet文件 | ||
|
||
## 接口介绍 | ||
|
||
### API说明 | ||
|
||
```python | ||
class ParquetDataset(dataset_ops.DatasetV2): | ||
def __init__( | ||
self, filenames, | ||
batch_size=1, | ||
fields=None, | ||
partition_count=1, | ||
partition_index=0, | ||
drop_remainder=False, | ||
num_parallel_reads=None, | ||
num_sequential_reads=1): | ||
|
||
# Create a `ParquetDataset` from filenames dataset. | ||
def read_parquet( | ||
batch_size, | ||
fields=None, | ||
partition_count=1, | ||
partition_index=0, | ||
drop_remainder=False, | ||
num_parallel_reads=None, | ||
num_sequential_reads=1): | ||
``` | ||
|
||
### 参数说明 | ||
|
||
- filenames: A 0-D or 1-D `tf.string` tensor containing one or more filenames. | ||
- batch_size: (Optional.) Maxium number of samples in an output batch. | ||
- fields: (Optional.) List of DataFrame fields. | ||
- partition_count: (Optional.) Count of row group partitions. | ||
- partition_index: (Optional.) Index of row group partitions. | ||
- drop_remainder: (Optional.) If True, only keep batches with exactly `batch_size` samples. | ||
- num_parallel_reads: (Optional.) A `tf.int64` scalar representing the number of files to read in parallel. Defaults to reading files sequentially. | ||
- num_sequential_reads: (Optional.) A `tf.int64` scalar representing the number of batches to read in sequential. Defaults to 1. | ||
|
||
## 使用示例 | ||
|
||
### 1. Example: Read from one file on local filesystem | ||
```python | ||
import tensorflow as tf | ||
from tensorflow.python.data.experimental.ops import parquet_dataset_ops | ||
|
||
# Read from a parquet file. | ||
ds = parquet_dataset_ops.ParquetDataset('/path/to/f1.parquet', | ||
batch_size=1024) | ||
ds = ds.prefetch(4) | ||
it = tf.data.make_one_shot_iterator(ds) | ||
batch = it.get_next() | ||
# {'a': tensora, 'c': tensorc} | ||
``` | ||
### 2. Example: Read from filenames dataset | ||
```python | ||
import tensorflow as tf | ||
from tensorflow.python.data.experimental.ops import parquet_dataset_ops | ||
|
||
filenames = tf.data.Dataset.from_generator(func, tf.string, tf.TensorShape([])) | ||
# Define data frame fields. | ||
fields = [ | ||
parquet_dataset_ops.DataFrame.Field('A', tf.int64), | ||
parquet_dataset_ops.DataFrame.Field('C', tf.int64, ragged_rank=1)] | ||
# Read from parquet files by reading upstream filename dataset. | ||
ds = filenames.apply(parquet_dataset_ops.read_parquet(1024, fields=fields)) | ||
ds = ds.prefetch(4) | ||
it = tf.data.make_one_shot_iterator(ds) | ||
batch = it.get_next() | ||
# {'a': tensora, 'c': tensorc} | ||
... | ||
``` | ||
### 3. Example: Read from files on S3/OSS/HDFS | ||
|
||
```bash | ||
export S3_ENDPOINT=oss-cn-shanghai-internal.aliyuncs.com | ||
export AWS_ACCESS_KEY_ID=my_id | ||
export AWS_SECRET_ACCESS_KEY=my_secret | ||
export S3_ADDRESSING_STYLE=virtual | ||
``` | ||
|
||
```{eval-rst} | ||
.. note:: | ||
See https://docs.w3cub.com/tensorflow~guide/deploy/s3.html for more | ||
information. | ||
.. note:: | ||
Set `S3_ADDRESSING_STYLE` to `virtual` to support OSS. | ||
.. note:: | ||
Set `S3_USE_HTTPS` to `0` to use `http` for S3 endpoint. | ||
``` | ||
|
||
```python | ||
import tensorflow as tf | ||
from tensorflow.python.data.experimental.ops import parquet_dataset_ops | ||
|
||
# Read from parquet files on remote services for selected fields. | ||
ds = parquet_dataset_ops.ParquetDataset( | ||
['s3://path/to/f1.parquet', | ||
'oss://path/to/f2.parquet', | ||
'hdfs://host:port/path/to/f3.parquet'], | ||
batch_size=1024, | ||
fields=['a', 'c']) | ||
ds = ds.prefetch(4) | ||
it = tf.data.make_one_shot_iterator(ds) | ||
batch = it.get_next() | ||
# {'a': tensora, 'c': tensorc} | ||
... | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -105,6 +105,7 @@ BFloat16 | |
WorkQueue | ||
KafkaDataset | ||
KafkaGroupIODataset | ||
ParquetDataset | ||
``` | ||
|
||
```{toctree} | ||
|