Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1. #20244

Open
Bhargav230m opened this issue Sep 3, 2024 · 2 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@Bhargav230m
Copy link

Bug description

Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."

I am using 8 TPU cores, Here my Trainer:

trainer = Trainer(
    max_epochs=50,
    accelerator="tpu",
    devices=8,
    callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=2)]
)

I am new to machine learning please tell me if I make mistakes

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.302361    2870 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8476 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.407367    2874 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.442340    2878 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.453311    2882 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:483
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 59, in _run_thread_per_device
    initializer_fn(local_rank, local_world_size)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 125, in initialize_multiprocess
    devices = xm.get_xla_supported_devices()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 99, in get_xla_supported_devices
    devices = torch_xla._XLAC._xla_get_devices()
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
"""

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[47], line 12
      1 model = ToxicCommentModel(
      2     input_size=hyperparameters["input_size"], 
      3     hidden_size=hyperparameters["linear_hidden_size"],  
   (...)
     10     max_len=hyperparameters["context_length"]
     11 )
---> 12 trainer.fit(model, data_module)

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    536 self.state.status = TrainerStatus.RUNNING
    537 self.training = True
--> 538 call._call_and_handle_interrupt(
    539     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    540 )

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     44 try:
     45     if trainer.strategy.launcher is not None:
---> 46         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     47     return trainer_fn(*args, **kwargs)
     49 except _TunerExitException:

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/xla.py:98, in _XLALauncher.launch(self, function, trainer, *args, **kwargs)
     93 if nprocs == 1:
     94     # avoid warning: "Unsupported nprocs". If it's 1, it will call the launched function directly.
     95     # otherwise it will use all devices
     96     spawn_kwargs["nprocs"] = nprocs
---> 98 process_context = xmp.spawn(
     99     self._wrapping_function,
    100     args=(trainer, function, args, kwargs, return_queue),
    101     start_method=self._start_method,
    102     join=False,  # we will join ourselves to get the process references
    103     **spawn_kwargs,
    104 )
    105 # xla will not actually create processes if only 1 device
    106 if process_context is not None:

File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
     91 if not using_pjrt():
     92   raise NotImplementedError('`{}` not implemented for XRT'.format(
     93       fn.__name__))
---> 95 return fn(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py:38, in spawn(fn, args, nprocs, join, daemon, start_method)
      6 @xr.requires_pjrt
      7 def spawn(fn,
      8           args=(),
   (...)
     11           daemon=False,
     12           start_method='spawn'):
     13   """Enables multi processing based replication.
     14 
     15   Args:
   (...)
     36     return None.
     37   """
---> 38   return pjrt.spawn(fn, nprocs, start_method, args)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:214, in spawn(fn, nprocs, start_method, args)
    211 elif nprocs is not None:
    212   logging.warning('Unsupported nprocs (%d), ignoring...' % nprocs)
--> 214 run_multiprocess(spawn_fn, start_method=start_method)

File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
     91 if not using_pjrt():
     92   raise NotImplementedError('`{}` not implemented for XRT'.format(
     93       fn.__name__))
---> 95 return fn(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:174, in run_multiprocess(fn, start_method, *args, **kwargs)
    168   mp_fn = functools.partial(
    169       _run_thread_per_device,
    170       local_world_size=num_processes,
    171       fn=functools.partial(fn, *args, **kwargs),
    172       initializer_fn=initialize_multiprocess)
    173   process_results = executor.map(mp_fn, range(num_processes))
--> 174   replica_results = list(
    175       itertools.chain.from_iterable(
    176           result.items() for result in process_results))
    178 return _merge_replica_results(replica_results)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:175, in <genexpr>(.0)
    168   mp_fn = functools.partial(
    169       _run_thread_per_device,
    170       local_world_size=num_processes,
    171       fn=functools.partial(fn, *args, **kwargs),
    172       initializer_fn=initialize_multiprocess)
    173   process_results = executor.map(mp_fn, range(num_processes))
    174   replica_results = list(
--> 175       itertools.chain.from_iterable(
    176           result.items() for result in process_results))
    178 return _merge_replica_results(replica_results)

File /usr/local/lib/python3.10/concurrent/futures/process.py:575, in _chain_from_iterable_of_lists(iterable)
    569 def _chain_from_iterable_of_lists(iterable):
    570     """
    571     Specialized implementation of itertools.chain.from_iterable.
    572     Each item in *iterable* should be a list.  This function is
    573     careful not to keep references to yielded objects.
    574     """
--> 575     for element in iterable:
    576         element.reverse()
    577         while element:

File /usr/local/lib/python3.10/concurrent/futures/_base.py:621, in Executor.map.<locals>.result_iterator()
    618 while fs:
    619     # Careful not to keep a reference to the popped future
    620     if timeout is None:
--> 621         yield _result_or_cancel(fs.pop())
    622     else:
    623         yield _result_or_cancel(fs.pop(), end_time - time.monotonic())

File /usr/local/lib/python3.10/concurrent/futures/_base.py:319, in _result_or_cancel(***failed resolving arguments***)
    317 try:
    318     try:
--> 319         return fut.result(timeout)
    320     finally:
    321         fut.cancel()

File /usr/local/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
    456     raise CancelledError()
    457 elif self._state == FINISHED:
--> 458     return self.__get_result()
    459 else:
    460     raise TimeoutError()

File /usr/local/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Environment

Current environment
  • CUDA:
    • GPU: None
    • available: False
    • version: 12.1
  • Lightning:
    • lightning-utilities: 0.11.7
    • pytorch-lightning: 2.4.0
    • torch: 2.4.0
    • torch-xla: 2.4.0+libtpu
    • torchaudio: 2.4.0
    • torchmetrics: 1.4.1
    • torchvision: 0.19.0
  • Packages:
    • absl-py: 2.1.0
    • accelerate: 0.33.0
    • aiofiles: 22.1.0
    • aiohappyeyeballs: 2.4.0
    • aiohttp: 3.10.5
    • aiosignal: 1.3.1
    • aiosqlite: 0.20.0
    • albucore: 0.0.13
    • albumentations: 1.4.14
    • annotated-types: 0.7.0
    • ansicolors: 1.1.8
    • anyio: 4.4.0
    • argon2-cffi: 23.1.0
    • argon2-cffi-bindings: 21.2.0
    • array-record: 0.5.1
    • arrow: 1.3.0
    • astroid: 3.2.4
    • asttokens: 2.4.1
    • astunparse: 1.6.3
    • async-timeout: 4.0.3
    • attrs: 24.2.0
    • audioread: 3.0.1
    • autopep8: 2.0.4
    • babel: 2.16.0
    • beautifulsoup4: 4.12.3
    • bleach: 6.1.0
    • blis: 0.7.11
    • cachetools: 5.5.0
    • catalogue: 2.0.10
    • certifi: 2024.7.4
    • cffi: 1.17.0
    • charset-normalizer: 3.3.2
    • chex: 0.1.86
    • click: 8.1.7
    • cloud-tpu-client: 0.10
    • cloudpathlib: 0.19.0
    • cloudpickle: 3.0.0
    • comm: 0.2.2
    • confection: 0.1.5
    • contourpy: 1.2.1
    • cramjam: 2.8.3
    • cycler: 0.12.1
    • cymem: 2.0.8
    • debugpy: 1.8.5
    • decorator: 5.1.1
    • defusedxml: 0.7.1
    • diffusers: 0.30.0
    • dill: 0.3.8
    • distrax: 0.1.5
    • dm-haiku: 0.0.13.dev0
    • dm-tree: 0.1.8
    • docstring-parser: 0.16
    • docstring-to-markdown: 0.15
    • einops: 0.8.0
    • en-core-web-sm: 3.7.1
    • entrypoints: 0.4
    • etils: 1.7.0
    • eval-type-backport: 0.2.0
    • exceptiongroup: 1.2.2
    • executing: 2.0.1
    • fastjsonschema: 2.20.0
    • fastparquet: 2024.5.0
    • filelock: 3.15.4
    • flake8: 7.0.0
    • flatbuffers: 24.3.25
    • flax: 0.8.4
    • fonttools: 4.53.1
    • fqdn: 1.5.1
    • frozenlist: 1.4.1
    • fsspec: 2024.6.1
    • funcsigs: 1.0.2
    • gast: 0.6.0
    • gin-config: 0.5.0
    • google-api-core: 1.34.1
    • google-api-python-client: 1.8.0
    • google-auth: 2.34.0
    • google-auth-httplib2: 0.2.0
    • google-pasta: 0.2.0
    • googleapis-common-protos: 1.63.2
    • grpcio: 1.65.5
    • gym: 0.26.2
    • gym-notices: 0.0.8
    • h5py: 3.11.0
    • httplib2: 0.22.0
    • huggingface-hub: 0.24.6
    • idna: 3.7
    • imageio: 2.35.1
    • immutabledict: 4.2.0
    • importlib-metadata: 8.3.0
    • importlib-resources: 6.4.3
    • ipykernel: 6.29.5
    • ipython: 8.26.0
    • ipython-genutils: 0.2.0
    • isoduration: 20.11.0
    • isort: 5.13.2
    • jax: 0.4.23
    • jaxlib: 0.4.23
    • jedi: 0.19.1
    • jinja2: 3.1.4
    • jmp: 0.0.4
    • joblib: 1.4.2
    • jraph: 0.0.6.dev0
    • json5: 0.9.25
    • jsonpointer: 3.0.0
    • jsonschema: 4.23.0
    • jsonschema-specifications: 2023.12.1
    • jupyter-client: 7.4.9
    • jupyter-core: 5.7.2
    • jupyter-events: 0.10.0
    • jupyter-lsp: 1.5.1
    • jupyter-server: 2.14.2
    • jupyter-server-fileid: 0.9.2
    • jupyter-server-terminals: 0.5.3
    • jupyter-server-ydoc: 0.8.0
    • jupyter-ydoc: 0.2.5
    • jupyterlab: 3.6.7
    • jupyterlab-pygments: 0.3.0
    • jupyterlab-server: 2.27.3
    • kagglehub: 0.2.9
    • keras: 3.5.0
    • keras-core: 0.1.7
    • keras-cv: 0.9.0
    • keras-nlp: 0.14.4
    • kiwisolver: 1.4.5
    • langcodes: 3.4.0
    • language-data: 1.2.0
    • lazy-loader: 0.4
    • libclang: 18.1.1
    • librosa: 0.10.2.post1
    • libtpu-nightly: 0.1.dev20231213
    • lightning-utilities: 0.11.7
    • llvmlite: 0.43.0
    • marisa-trie: 1.2.0
    • markdown: 3.7
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.5
    • matplotlib: 3.9.2
    • matplotlib-inline: 0.1.7
    • mccabe: 0.7.0
    • mdurl: 0.1.2
    • mistune: 3.0.2
    • ml-dtypes: 0.3.2
    • mpmath: 1.3.0
    • msgpack: 1.0.8
    • multidict: 6.0.5
    • murmurhash: 1.0.10
    • namex: 0.0.8
    • nbclassic: 1.1.0
    • nbclient: 0.10.0
    • nbconvert: 7.16.4
    • nbformat: 5.10.4
    • nest-asyncio: 1.6.0
    • networkx: 3.3
    • notebook: 6.5.7
    • notebook-shim: 0.2.4
    • numba: 0.60.0
    • numpy: 1.26.4
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 9.1.0.70
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.20.5
    • nvidia-nvjitlink-cu12: 12.6.20
    • nvidia-nvtx-cu12: 12.1.105
    • oauth2client: 4.1.3
    • opencv-python: 4.10.0.84
    • opencv-python-headless: 4.10.0.84
    • opt-einsum: 3.3.0
    • optax: 0.2.2
    • optree: 0.12.1
    • orbax-checkpoint: 0.5.16
    • overrides: 7.7.0
    • packaging: 24.1
    • pandas: 2.2.2
    • pandocfilters: 1.5.1
    • papermill: 2.6.0
    • parso: 0.8.4
    • pexpect: 4.9.0
    • pillow: 10.4.0
    • pip: 23.0.1
    • platformdirs: 4.2.2
    • pluggy: 1.5.0
    • pooch: 1.8.2
    • preshed: 3.0.9
    • prometheus-client: 0.20.0
    • promise: 2.3
    • prompt-toolkit: 3.0.47
    • protobuf: 3.20.3
    • psutil: 6.0.0
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.3
    • pyarrow: 17.0.0
    • pyasn1: 0.6.0
    • pyasn1-modules: 0.4.0
    • pycodestyle: 2.11.1
    • pycparser: 2.22
    • pydantic: 2.8.2
    • pydantic-core: 2.20.1
    • pydocstyle: 6.3.0
    • pyflakes: 3.2.0
    • pygments: 2.18.0
    • pylint: 3.2.6
    • pyparsing: 3.1.2
    • python-dateutil: 2.9.0.post0
    • python-json-logger: 2.0.7
    • python-lsp-jsonrpc: 1.1.2
    • python-lsp-server: 1.11.0
    • pytoolconfig: 1.3.1
    • pytorch-lightning: 2.4.0
    • pytz: 2024.1
    • pyyaml: 6.0.2
    • pyzmq: 26.1.1
    • referencing: 0.35.1
    • regex: 2024.7.24
    • requests: 2.32.3
    • rfc3339-validator: 0.1.4
    • rfc3986-validator: 0.1.1
    • rich: 13.7.1
    • rope: 1.13.0
    • rpds-py: 0.20.0
    • rsa: 4.9
    • safetensors: 0.4.4
    • scikit-image: 0.24.0
    • scikit-learn: 1.5.1
    • scipy: 1.14.0
    • seaborn: 0.13.2
    • send2trash: 1.8.3
    • setuptools: 65.5.1
    • shellingham: 1.5.4
    • simple-parsing: 0.1.5
    • six: 1.16.0
    • smart-open: 7.0.4
    • sniffio: 1.3.1
    • snowballstemmer: 2.2.0
    • soundfile: 0.12.1
    • soupsieve: 2.6
    • soxr: 0.4.0
    • spacy: 3.7.6
    • spacy-legacy: 3.0.12
    • spacy-loggers: 1.0.5
    • srsly: 2.4.8
    • stack-data: 0.6.3
    • sympy: 1.13.2
    • tabulate: 0.9.0
    • tenacity: 9.0.0
    • tensorboard: 2.17.1
    • tensorboard-data-server: 0.7.2
    • tensorflow-cpu: 2.17.0
    • tensorflow-datasets: 4.9.6
    • tensorflow-hub: 0.16.1
    • tensorflow-io: 0.37.1
    • tensorflow-io-gcs-filesystem: 0.37.1
    • tensorflow-metadata: 1.15.0
    • tensorflow-probability: 0.24.0
    • tensorflow-text: 2.16.1
    • tensorstore: 0.1.64
    • termcolor: 2.4.0
    • terminado: 0.18.1
    • tf-keras: 2.16.0
    • thinc: 8.2.5
    • threadpoolctl: 3.5.0
    • tifffile: 2024.8.10
    • timm: 1.0.8
    • tinycss2: 1.3.0
    • tokenizers: 0.19.1
    • toml: 0.10.2
    • tomli: 2.0.1
    • tomlkit: 0.13.2
    • toolz: 0.12.1
    • torch: 2.4.0
    • torch-xla: 2.4.0+libtpu
    • torchaudio: 2.4.0
    • torchmetrics: 1.4.1
    • torchvision: 0.19.0
    • tornado: 6.4.1
    • tqdm: 4.66.5
    • traitlets: 5.14.3
    • transformers: 4.44.0
    • trax: 1.4.1
    • triton: 3.0.0
    • typer: 0.12.5
    • types-python-dateutil: 2.9.0.20240316
    • typing-extensions: 4.12.2
    • tzdata: 2024.1
    • ujson: 5.10.0
    • uri-template: 1.3.0
    • uritemplate: 3.0.1
    • urllib3: 2.2.2
    • wasabi: 1.1.3
    • wcwidth: 0.2.13
    • weasel: 0.4.1
    • webcolors: 24.8.0
    • webencodings: 0.5.1
    • websocket-client: 1.8.0
    • werkzeug: 3.0.3
    • whatthepatch: 1.0.6
    • wheel: 0.44.0
    • wrapt: 1.16.0
    • y-py: 0.6.2
    • yapf: 0.40.2
    • yarl: 1.9.7
    • ypy-websocket: 0.8.4
    • zipp: 3.20.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor:
    • python: 3.10.14
    • release: 6.1.42+
    • version: Proposal for help #1 SMP PREEMPT_DYNAMIC Sun Oct 8 14:23:56 UTC 2023

More info

No response

@Bhargav230m Bhargav230m added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 3, 2024
@Bhargav230m
Copy link
Author

anyone?

@ibinti
Copy link

ibinti commented Sep 8, 2024

anyone?

this is not a lightning bug. i had the exactly same error on kaggle tpu v3-8 and found the fix in the kaggle product feedback discussion. here is the link: https://www.kaggle.com/discussions/product-feedback/473974
tl;dr: remove offending environment variable os.environ.pop('TPU_PROCESS_ADDRESSES')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

2 participants