Closed
Description
I am getting a core dump during interpreter teardown, when using the s3 filesystem. Can I please be given guidance how to handle this issue. Please see script to reproduce inside docker:
FROM tensorflow/tensorflow:2.14.0-gpu
The following environment variables are set
"AWS_ACCESS_KEY_ID": xxx,
"AWS_SECRET_ACCESS_KEY": xxx,
"AWS_ENDPOINT_URL_S3": xxx,
"AWS_REGION": "us-east-1",
"S3_USE_HTTPS": "1",
"S3_VERIFY_SSL": "1",
"S3_DISABLE_MULTI_PART_DOWNLOAD": "0",
"S3_ENDPOINT": xxx,
import os
import tensorflow as tf
import tensorflow_io as tfio
def illustrate_core_dump():
print(f"tf version: {tf.__version__}")
print(f"tfio version: {tfio.__version__}")
filename = f"{os.environ['CLOUD_MOUNT']}/tmp/test_tfrecord.tfrecord"
assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
ds = tf.data.TFRecordDataset(filename, "GZIP")
for i in ds:
print(f"i.shape: {i.shape}")
if __name__ == "__main__":
illustrate_core_dump()
print("reaches here successfully")
print("something broken during destruction and tf")
# during interpreter teardown if s3 filesystem used we will get
# pure virtual method called
# terminate called without an active exception
# Aborted (core dumped)
# gs:// and file:// do not exhibit this issue which don't rely on tfio
TF_CPP_MIN_LOG_LEVEL=0 python notebooks/illustrate_core_dump.py
2024-01-01 18:07:11.253238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 18:07:11.253287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 18:07:11.253323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 18:07:11.262384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf version: 2.14.0
tfio version: 0.35.0
2024-01-01 18:07:14.402239: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.413303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.416545: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.421598: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.423868: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.426098: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.494277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.496519: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.498484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.500342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
i.shape: ()
reaches here successfully
something broken during destruction and tf
pure virtual method called
terminate called without an active exception
Aborted (core dumped)
Metadata
Metadata
Assignees
Labels
No labels