Adding Beam Dataset, Out of Disk Space on Dataflow using GCS

**What I need help with / What I was wondering**
I would like to add a dataset for later contribution: https://github.com/Ouwen/datasets/tree/tfds

**What I've tried so far**
Small slices of data locally and on Google Cloud Dataflow work fine. I run the following script:
```
DATASET_NAME=duke_ultranet/dynamic_rx_beamformed
GCP_PROJECT=my-project
GCS_BUCKET=bucket-location
echo "git+git://github.com/ouwen/datasets@tfds" > /tmp/beam_requirements.txt

python3 -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options=\
"runner=DataflowRunner,project=$GCP_PROJECT,job_name=duke-ultranet-dynamic-rx-beamformed-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"requirements_file=/tmp/beam_requirements.txt,region=us-east1,"\
"autoscaling_algorithm=NONE,num_workers=20,"\
"machine_type=n1-highmem-16,experiments=shuffle_mode=service"
```

**I get the following errors using Data Shuffle on the full dataset (3.85 TB).**
```
IOError: [Errno 28] No space left on device
at
copyfile (/usr/lib64/python2.7/shutil.py:84)
at
copy (/usr/lib64/python2.7/shutil.py:119)
at
SetConfiguredUsers (/usr/lib64/python2.7/site-packages/google_compute_engine/accounts/accounts_utils.py:293)
at
HandleAccounts (/usr/lib64/python2.7/site-packages/google_compute_engine/accounts/accounts_daemon.py:263)
at
WatchMetadata (/usr/lib64/python2.7/site-packages/google_compute_engine/metadata_watcher.py:196)
```
```
RuntimeError: tensorflow.python.framework.errors_impl.InternalError: Could not write to the internal temporary file. [while running 'train/WriteFinalShards']
at
__exit__ (/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/errors_impl.py:554)
at
close (/usr/local/lib/python3.7/site-packages/tensorflow_core/python/lib/io/tf_record.py:246)
at
__exit__ (/usr/local/lib/python3.7/site-packages/tensorflow_core/python/lib/io/tf_record.py:227)
at
_write_tfrecord (/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/tfrecords_writer.py:121)
at
_write_final_shard (/home/jupyter/datasets/tensorflow_datasets/core/tfrecords_writer.py:351)
at
<lambda> (/opt/conda/lib/python3.7/site-packages/apache_beam/transforms/core.py:1437)
at
apache_beam.runners.common.SimpleInvoker.invoke_process (common.py:498)
at
apache_beam.runners.common.DoFnRunner.process (common.py:883)
at
raise_with_traceback (/usr/local/lib/python3.7/site-packages/future/utils/__init__.py:421)
at
apache_beam.runners.common.DoFnRunner._reraise_augmented (common.py:956)
at
apache_beam.runners.common.DoFnRunner.process (common.py:885)
at
apache_beam.runners.common.DoFnRunner.receive (common.py:878)
at
apache_beam.runners.worker.operations.DoOperation.process (operations.py:658)
at
apache_beam.runners.worker.operations.DoOperation.process (operations.py:657)
at
apache_beam.runners.worker.operations.SingletonConsumerSet.receive (operations.py:178)
at
apache_beam.runners.worker.operations.Operation.output (operations.py:304)
at
dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process (shuffle_operations.py:268)
at
dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process (shuffle_operations.py:261)
at
apache_beam.runners.worker.operations.SingletonConsumerSet.receive (operations.py:178)
at
apache_beam.runners.worker.operations.Operation.output (operations.py:304)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:84)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:80)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:79)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:64)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:63)
at
execute (/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py:176)
at
do_work (/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py:648)
```

**If I run without DataShuffle but with 500gb disks on 10 workers (5TB for a 2.5TB dataset)**

```
return dill.loads(s): 
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
 return load(file, ignore, **kwds) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
 return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
 obj = StockUnpickler.load(self) File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 210, in __setstate__
 self.__init__(**state) File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1116, in __init__
 super(BeamBasedBuilder, self).__init__(*args, **kwargs) 
File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 53, in disallow_positional_args_dec
 return fn(*args, **kwargs)
 File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 204, in __init__
 self.info.initialize_from_bucket() 
File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 428, in initialize_from_bucket
 tmp_dir = tempfile.mkdtemp("tfds") 
File "/usr/local/lib/python3.7/tempfile.py", line 366, in mkdtemp
 _os.mkdir(file, 0o700) 
OSError: [Errno 28] No space left on device: '/tmp/tmpereoz1ljtfds'
```

**It seems to process everything, but only have trouble writing to the final GCS directory is the temp GCS bucket location being respected?**

![image](https://user-images.githubusercontent.com/5455421/77233266-c1d06c00-6b7c-11ea-9a1a-e5f7cba40a14.png)


**Environment information**
* Operating System: Linux
* Python version: 3.7
* Version: `https://github.com/Ouwen/datasets/tree/tfds`
* `tensorflow` version: 2.1


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Beam Dataset, Out of Disk Space on Dataflow using GCS #1689

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding Beam Dataset, Out of Disk Space on Dataflow using GCS #1689

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions