Open
Description
What I need help with / What I was wondering
I would like to add a dataset for later contribution: https://github.com/Ouwen/datasets/tree/tfds
What I've tried so far
Small slices of data locally and on Google Cloud Dataflow work fine. I run the following script:
DATASET_NAME=duke_ultranet/dynamic_rx_beamformed
GCP_PROJECT=my-project
GCS_BUCKET=bucket-location
echo "git+git://github.com/ouwen/datasets@tfds" > /tmp/beam_requirements.txt
python3 -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=$DATASET_NAME \
--data_dir=$GCS_BUCKET/tensorflow_datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$GCP_PROJECT,job_name=duke-ultranet-dynamic-rx-beamformed-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"requirements_file=/tmp/beam_requirements.txt,region=us-east1,"\
"autoscaling_algorithm=NONE,num_workers=20,"\
"machine_type=n1-highmem-16,experiments=shuffle_mode=service"
I get the following errors using Data Shuffle on the full dataset (3.85 TB).
IOError: [Errno 28] No space left on device
at
copyfile (/usr/lib64/python2.7/shutil.py:84)
at
copy (/usr/lib64/python2.7/shutil.py:119)
at
SetConfiguredUsers (/usr/lib64/python2.7/site-packages/google_compute_engine/accounts/accounts_utils.py:293)
at
HandleAccounts (/usr/lib64/python2.7/site-packages/google_compute_engine/accounts/accounts_daemon.py:263)
at
WatchMetadata (/usr/lib64/python2.7/site-packages/google_compute_engine/metadata_watcher.py:196)
RuntimeError: tensorflow.python.framework.errors_impl.InternalError: Could not write to the internal temporary file. [while running 'train/WriteFinalShards']
at
__exit__ (/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/errors_impl.py:554)
at
close (/usr/local/lib/python3.7/site-packages/tensorflow_core/python/lib/io/tf_record.py:246)
at
__exit__ (/usr/local/lib/python3.7/site-packages/tensorflow_core/python/lib/io/tf_record.py:227)
at
_write_tfrecord (/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/tfrecords_writer.py:121)
at
_write_final_shard (/home/jupyter/datasets/tensorflow_datasets/core/tfrecords_writer.py:351)
at
<lambda> (/opt/conda/lib/python3.7/site-packages/apache_beam/transforms/core.py:1437)
at
apache_beam.runners.common.SimpleInvoker.invoke_process (common.py:498)
at
apache_beam.runners.common.DoFnRunner.process (common.py:883)
at
raise_with_traceback (/usr/local/lib/python3.7/site-packages/future/utils/__init__.py:421)
at
apache_beam.runners.common.DoFnRunner._reraise_augmented (common.py:956)
at
apache_beam.runners.common.DoFnRunner.process (common.py:885)
at
apache_beam.runners.common.DoFnRunner.receive (common.py:878)
at
apache_beam.runners.worker.operations.DoOperation.process (operations.py:658)
at
apache_beam.runners.worker.operations.DoOperation.process (operations.py:657)
at
apache_beam.runners.worker.operations.SingletonConsumerSet.receive (operations.py:178)
at
apache_beam.runners.worker.operations.Operation.output (operations.py:304)
at
dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process (shuffle_operations.py:268)
at
dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process (shuffle_operations.py:261)
at
apache_beam.runners.worker.operations.SingletonConsumerSet.receive (operations.py:178)
at
apache_beam.runners.worker.operations.Operation.output (operations.py:304)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:84)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:80)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:79)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:64)
at
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:63)
at
execute (/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py:176)
at
do_work (/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py:648)
If I run without DataShuffle but with 500gb disks on 10 workers (5TB for a 2.5TB dataset)
return dill.loads(s):
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self) File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 210, in __setstate__
self.__init__(**state) File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1116, in __init__
super(BeamBasedBuilder, self).__init__(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 53, in disallow_positional_args_dec
return fn(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 204, in __init__
self.info.initialize_from_bucket()
File "/usr/local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 428, in initialize_from_bucket
tmp_dir = tempfile.mkdtemp("tfds")
File "/usr/local/lib/python3.7/tempfile.py", line 366, in mkdtemp
_os.mkdir(file, 0o700)
OSError: [Errno 28] No space left on device: '/tmp/tmpereoz1ljtfds'
It seems to process everything, but only have trouble writing to the final GCS directory is the temp GCS bucket location being respected?
Environment information
- Operating System: Linux
- Python version: 3.7
- Version:
https://github.com/Ouwen/datasets/tree/tfds
tensorflow
version: 2.1