You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."
I am new to machine learning please tell me if I make mistakes
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.302361 2870 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8476 in any of the 0 ports provided in`tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.407367 2874 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 0 ports provided in`tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.442340 2878 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 0 ports provided in`tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.453311 2882 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 0 ports provided in`tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""Traceback (most recent call last): File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk return [fn(*args) for args in chunk] File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp> return [fn(*args) for args in chunk] File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 59, in _run_thread_per_device initializer_fn(local_rank, local_world_size) File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 125, in initialize_multiprocess devices = xm.get_xla_supported_devices() File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 99, in get_xla_supported_devices devices = torch_xla._XLAC._xla_get_devices()RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."""
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[47], line 12
1 model = ToxicCommentModel(
2 input_size=hyperparameters["input_size"],
3 hidden_size=hyperparameters["linear_hidden_size"],
(...)
10 max_len=hyperparameters["context_length"]
11 )
---> 12 trainer.fit(model, data_module)
File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
536 self.state.status = TrainerStatus.RUNNING
537 self.training = True
--> 538 call._call_and_handle_interrupt(
539 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
540 )
File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
44 try:
45 if trainer.strategy.launcher is not None:
---> 46 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
47 return trainer_fn(*args, **kwargs)
49 except _TunerExitException:
File /usr/local/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/xla.py:98, in _XLALauncher.launch(self, function, trainer, *args, **kwargs)
93 if nprocs == 1:
94 # avoid warning: "Unsupported nprocs". If it's 1, it will call the launched function directly.
95 # otherwise it will use all devices
96 spawn_kwargs["nprocs"] = nprocs
---> 98 process_context = xmp.spawn(
99 self._wrapping_function,
100 args=(trainer, function, args, kwargs, return_queue),
101 start_method=self._start_method,
102 join=False, # we will join ourselves to get the process references
103 **spawn_kwargs,
104 )
105 # xla will not actually create processes if only 1 device
106 if process_context is not None:
File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
91 if not using_pjrt():
92 raise NotImplementedError('`{}` not implemented for XRT'.format(
93 fn.__name__))
---> 95 return fn(*args, **kwargs)
File /usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py:38, in spawn(fn, args, nprocs, join, daemon, start_method)
6 @xr.requires_pjrt
7 def spawn(fn,
8 args=(),
(...)
11 daemon=False,
12 start_method='spawn'):
13 """Enables multi processing based replication. 14 15 Args: (...) 36 return None. 37 """
---> 38 return pjrt.spawn(fn, nprocs, start_method, args)
File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:214, in spawn(fn, nprocs, start_method, args)
211 elif nprocs is not None:
212 logging.warning('Unsupported nprocs (%d), ignoring...' % nprocs)
--> 214 run_multiprocess(spawn_fn, start_method=start_method)
File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
91 if not using_pjrt():
92 raise NotImplementedError('`{}` not implemented for XRT'.format(
93 fn.__name__))
---> 95 return fn(*args, **kwargs)
File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:174, in run_multiprocess(fn, start_method, *args, **kwargs)
168 mp_fn = functools.partial(
169 _run_thread_per_device,
170 local_world_size=num_processes,
171 fn=functools.partial(fn, *args, **kwargs),
172 initializer_fn=initialize_multiprocess)
173 process_results = executor.map(mp_fn, range(num_processes))
--> 174 replica_results = list(
175 itertools.chain.from_iterable(
176 result.items() forresultin process_results))
178 return _merge_replica_results(replica_results)
File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:175, in<genexpr>(.0)
168 mp_fn = functools.partial(
169 _run_thread_per_device,
170 local_world_size=num_processes,
171 fn=functools.partial(fn, *args, **kwargs),
172 initializer_fn=initialize_multiprocess)
173 process_results = executor.map(mp_fn, range(num_processes))
174 replica_results = list(
--> 175 itertools.chain.from_iterable(
176 result.items() forresultin process_results))
178 return _merge_replica_results(replica_results)
File /usr/local/lib/python3.10/concurrent/futures/process.py:575, in _chain_from_iterable_of_lists(iterable)
569 def _chain_from_iterable_of_lists(iterable):
570 """ 571 Specialized implementation of itertools.chain.from_iterable. 572 Each item in *iterable* should be a list. This function is 573 careful not to keep references to yielded objects. 574 """
--> 575 forelementin iterable:
576 element.reverse()
577 while element:
File /usr/local/lib/python3.10/concurrent/futures/_base.py:621, inExecutor.map.<locals>.result_iterator()
618 while fs:
619 # Careful not to keep a reference to the popped future
620 if timeout is None:
--> 621 yield _result_or_cancel(fs.pop())
622 else:
623 yield _result_or_cancel(fs.pop(), end_time - time.monotonic())
File /usr/local/lib/python3.10/concurrent/futures/_base.py:319, in _result_or_cancel(***failed resolving arguments***)
317 try:
318 try:
--> 319 return fut.result(timeout)
320 finally:
321 fut.cancel()
File /usr/local/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
456 raise CancelledError()
457 elif self._state == FINISHED:
--> 458 returnself.__get_result()
459 else:
460 raise TimeoutError()
File /usr/local/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception
406 self = None
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
this is not a lightning bug. i had the exactly same error on kaggle tpu v3-8 and found the fix in the kaggle product feedback discussion. here is the link: https://www.kaggle.com/discussions/product-feedback/473974
tl;dr: remove offending environment variable os.environ.pop('TPU_PROCESS_ADDRESSES')
Bug description
Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."
I am using 8 TPU cores, Here my Trainer:
I am new to machine learning please tell me if I make mistakes
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: