Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reach local dispatcher #2

Open
darthnoward opened this issue Dec 21, 2023 · 0 comments
Open

Unable to reach local dispatcher #2

darthnoward opened this issue Dec 21, 2023 · 0 comments

Comments

@darthnoward
Copy link

i Have two machine, one with GPU, ip address at 10.42.0.1, another is remote CPU worker, ip address at 192.168.1.136.

remote is running

import tensorflow as tf

 d_config = tf.data.experimental.service.DispatcherConfig(port=5000)
 dispatcher = tf.data.experimental.service.DispatchServer(d_config)

 w_port = 5001
 w_config = tf.data.experimental.service.WorkerConfig(
     dispatcher_address=dispatcher.target.split("://")[1],
     worker_address="192.168.1.136" + ":" + str(w_port),
     port=w_port)
 worker = tf.data.experimental.service.WorkerServer(w_config)

 dispatcher.join()

local then runs python eval_app_runner.py ctc_asr_app.py /home/haolan/FastFlow/examples/ ff /home/haolan/FastFlow/examples/default_config.yaml --gpu_type=single

with default_config.yaml being

dispatcher_addr: 192.168.1.136
dispatcher_port: 5000
num_profile_steps: 10
num_initial_steps: 5

However, it fails to reach local dispatcher (from what looks like, itself) in certain point with error message being:

2023-12-21 18:12:30.910036: I tensorflow/core/data/service/grpc_util.cc:68] Failed to check service version: UNAVAILABLE: Failed to get dispatcher version from dispatcher running at 10.42.0.1 172.17.37.106 10.12.146.252 172.17.0.1 192.168.1.125 100.104.160.22 172.22.2.2 fd97:8600:8edd:0:215d:6f4d:96a3:ab0c fd97:8600:8edd:0:e105:6cf9:5b33:c950 fd97:8600:8edd:0:d376:e947:bac6:dc12 fd97:8600:8edd:0:c705:b135:faa7:b989 fd97:8600:8edd:0:6805:9ee2:f28a:c366 fd97:8600:8edd:0:3430:95e4:3bf:8a50 fd97:8600:8edd:0:e173:3808:e1a7:630b fd97:8600:8edd::15b fd97:8600:8edd:0:58c2:e087:eb4a:5b7 fd7a:115c:a1e0:ab12:4843:cd96:6268:a016:5000: DNS resolution failed. Will retry in 158ms.

I'm not sure what causes this, is it because the ip addresses wasn't parsed to show a single one? If so, where should i take a look at to produce a fix to it?

Full Log of local machine

$ python eval_app_runner.py ctc_asr_app.py /home/haolan/FastFlow/examples/ ff /home/haolan/FastFlow/examples/default_config.yaml --gpu_type=single


Args:  Namespace(app_file_path='ctc_asr_app.py', batch=1, data_prefix='/home/haolan/FastFlow/examples/', epochs=2, gpu_type=<GPUType.SINGLE: 'single'>, num_local_workers=1, offloading_type=<OffloadingType.FASTFLOW: 'ff'>, parallel=-1, yaml_path='/home/haolan/FastFlow/examples/default_config.yaml')
2023-12-21 18:10:42.553745: I tensorflow/core/data/service/dispatcher_impl.cc:192] Running with fault_tolerant_mode=False. The dispatcher will not be able to recover its state on restart.
2023-12-21 18:10:42.553759: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data DispatchServer running at 0.0.0.0:5000
Launch local worker
2023-12-21 18:10:42.566467: I tensorflow/core/data/service/worker_impl.cc:150] Worker registered with dispatcher running at 10.42.0.1:5000
2023-12-21 18:10:42.566504: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data WorkerServer running at 0.0.0.0:5001
Launch local worker
2023-12-21 18:10:42.572939: I tensorflow/core/data/service/worker_impl.cc:150] Worker registered with dispatcher running at 192.168.1.136:5000
2023-12-21 18:10:42.572975: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data WorkerServer running at 0.0.0.0:5501
2023-12-21 18:10:42.609152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-21 18:10:42.624489: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-21 18:10:43.134878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43635 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-12-21 18:10:43.135153: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-21 18:10:43.135224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 43646 MB memory:  -> device: 1, name: NVIDIA RTX A6000, pci bus id: 0000:02:00.0, compute capability: 8.6
The vocabulary is: ['', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'", '?', '!', ' '] (size =31)
Size of the training set: 11790
Size of the validation set: 1310
[build_model] input_spectrogram: KerasTensor(type_spec=TensorSpec(shape=(None, None, 193), dtype=tf.float32, name='DeepSpeech-2input'), name='DeepSpeech-2input', description="created by layer 'DeepSpeech-2input'")
() {'optimizer': <keras.optimizer_v2.adam.Adam object at 0x7f6da6d05580>, 'loss': <function CTCLoss at 0x7f6d383d6820>}
[build_model] input_spectrogram: KerasTensor(type_spec=TensorSpec(shape=(None, None, 193), dtype=tf.float32, name='DeepSpeech-2-copyinput'), name='DeepSpeech-2-copyinput', description="created by layer 'DeepSpeech-2-copyinput'")
()
{'optimizer': <keras.optimizer_v2.adam.Adam object at 0x7f6da6d05580>, 'loss': <function CTCLoss at 0x7f6d383d6820>}
() {'optimizer': <keras.optimizer_v2.adam.Adam object at 0x7f6da6d05580>, 'loss': <function CTCLoss at 0x7f6d383d6820>}
<WeakKeyDictionary at 0x7f6cf4e97fd0>
0. Dummy training
2023-12-21 18:10:50.192211: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8907
2023-12-21 18:10:51.967789: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
1/1 [==============================] - 11s 11s/step - loss: 1829.3364
Measure ProfileMetrics.LTHP
10/10 [==============================] - 47s 5s/step - loss: 526.4906
Measure ProfileMetrics.GTHP
A builder instance for a PrefechDataset is being created.
prefetch is being applied.
10/10 [==============================] - 4s 396ms/step - loss: 334.6863
Does this app have a cpu bottleneck?  Yes
Measure ProfileMetrics.RTHP
A builder instance for a PrefechDataset is being created.
A builder instance for a PaddedBatchDataset is being created.
padded batch is being applied.
prefetch is being applied.
2023-12-21 18:11:48.122148: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:300] New iterator created 1 for job 0
2023-12-21 18:11:48.122181: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:325] Connecting to 192.168.1.136:5000 in FastFlowOffloadingFetch op
2023-12-21 18:11:48.225582: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4001
2023-12-21 18:11:48.226247: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4001 with processing mode sharding_policy: DYNAMIC

2023-12-21 18:11:48.238079: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:594] Starting FastFlowOp task thread manager
10/10 [==============================] - 21s 2s/step - loss: 314.9906
2023-12-21 18:12:08.963013: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3000
2023-12-21 18:12:08.963148: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:608] Task thread manager finished
2023-12-21 18:12:08.963159: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:609] Finished.. task size 2 finished_tasks: 0 num_local_request: 0 num_remote_request: 328 outstanding: 0 results: 0
2023-12-21 18:12:08.963249: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:304] Destroying data service dataset iterator 1 for job id 3000
2023-12-21 18:12:08.963259: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3000
Measure ProfileMetrics.RTHP_BATCH
A builder instance for a PrefechDataset is being created.
prefetch is being applied.
2023-12-21 18:12:09.114261: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:300] New iterator created 1 for job 0
2023-12-21 18:12:09.114274: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:325] Connecting to 192.168.1.136:5000 in FastFlowOffloadingFetch op
2023-12-21 18:12:09.192091: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4003
2023-12-21 18:12:09.192651: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4003 with processing mode sharding_policy: DYNAMIC

2023-12-21 18:12:09.205300: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:594] Starting FastFlowOp task thread manager
10/10 [==============================] - 21s 2s/step - loss: 309.4245
2023-12-21 18:12:30.330142: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3001
2023-12-21 18:12:30.330282: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:608] Task thread manager finished
2023-12-21 18:12:30.330294: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:609] Finished.. task size 2 finished_tasks: 0 num_local_request: 0 num_remote_request: 11 outstanding: 0 results: 0
2023-12-21 18:12:30.330481: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:304] Destroying data service dataset iterator 1 for job id 3001
2023-12-21 18:12:30.330494: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3001
Measure ProfileMetrics.RTHP_MID
A builder instance for a PrefechDataset is being created.
A builder instance for a PaddedBatchDataset is being created.
A builder instance for a ParallelMapDataset is being created.
2023-12-21 18:12:30.910036: I tensorflow/core/data/service/grpc_util.cc:68] Failed to check service version: UNAVAILABLE: Failed to get dispatcher version from dispatcher running at 10.42.0.1 172.17.37.106 10.12.146.252 172.17.0.1 192.168.1.125 100.104.160.22 172.22.2.2 fd97:8600:8edd:0:215d:6f4d:96a3:ab0c fd97:8600:8edd:0:e105:6cf9:5b33:c950 fd97:8600:8edd:0:d376:e947:bac6:dc12 fd97:8600:8edd:0:c705:b135:faa7:b989 fd97:8600:8edd:0:6805:9ee2:f28a:c366 fd97:8600:8edd:0:3430:95e4:3bf:8a50 fd97:8600:8edd:0:e173:3808:e1a7:630b fd97:8600:8edd::15b fd97:8600:8edd:0:58c2:e087:eb4a:5b7 fd7a:115c:a1e0:ab12:4843:cd96:6268:a016:5000: DNS resolution failed. Will retry in 158ms.
2023-12-21 18:12:31.068586: I tensorflow/core/data/service/grpc_util.cc:68] Failed to check service version: UNAVAILABLE: Failed to get dispatcher version from dispatcher running at 10.42.0.1 172.17.37.106 10.12.146.252 172.17.0.1 192.168.1.125 100.104.160.22 172.22.2.2 fd97:8600:8edd:0:215d:6f4d:96a3:ab0c fd97:8600:8edd:0:e105:6cf9:5b33:c950 fd97:8600:8edd:0:d376:e947:bac6:dc12 fd97:8600:8edd:0:c705:b135:faa7:b989 fd97:8600:8edd:0:6805:9ee2:f28a:c366 fd97:8600:8edd:0:3430:95e4:3bf:8a50 fd97:8600:8edd:0:e173:3808:e1a7:630b fd97:8600:8edd::15b fd97:8600:8edd:0:58c2:e087:eb4a:5b7 fd7a:115c:a1e0:ab12:4843:cd96:6268:a016:5000: DNS resolution failed. Will retry in 230ms.

full log of remote machine

2023-12-21 20:33:32.445696: I tensorflow/core/data/service/dispatcher_impl.cc:192] Running with fault_tolerant_mode=False. The dispatcher will not be able to recover its state on restart.
2023-12-21 20:33:32.445716: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data DispatchServer running at 0.0.0.0:5000
['grpc', 'localhost:5000']
2023-12-21 20:33:32.447133: I tensorflow/core/data/service/worker_impl.cc:150] Worker registered with dispatcher running at localhost:5000
2023-12-21 20:33:32.447200: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data WorkerServer running at 0.0.0.0:5001
2023-12-21 20:34:49.156332: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-21 20:34:49.199302: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4000 to worker 192.168.1.136:5001
2023-12-21 20:34:49.207790: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4000
2023-12-21 20:34:49.209143: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4000 with processing mode sharding_policy: DYNAMIC

2023-12-21 20:34:49.209399: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4000 to worker 192.168.1.136:5001
2023-12-21 20:34:49.209630: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4001 to worker 10.42.0.1:5501
2023-12-21 20:34:49.239816: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4001 to worker 10.42.0.1:5501
2023-12-21 20:35:10.186330: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4002 to worker 192.168.1.136:5001
2023-12-21 20:35:10.191330: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4002
2023-12-21 20:35:10.193713: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4002 with processing mode sharding_policy: DYNAMIC

2023-12-21 20:35:10.194258: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4002 to worker 192.168.1.136:5001
2023-12-21 20:35:10.194532: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4003 to worker 10.42.0.1:5501
2023-12-21 20:35:10.221530: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4003 to worker 10.42.0.1:5501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant