When caching is enabled, also enable XLA caching features as well #22899

trevor-m · 2024-08-06T18:07:53Z

This PR makes it easier to enable all of the caching features in JAX and XLA with a single option. Now, when the JAX persistent cache is enabled (JAX_COMPILATION_CACHE_DIR), some XLA caching features will also be enabled to subdirectories of the JAX cache dir. The XLA caching features that are used can be selected via JAX_PERSISTENT_CACHE_ENABLE_XLA_CACHES.

Currently, there is an issue related to kernel naming when both xla_gpu_kernel_cache_file and the JAX persistent cache are enabled together, so only the autotune cache is enabled by default now. Once this is fixed, the default value of JAX_PERSISTENT_CACHE_ENABLE_XLA_CACHES should be all.

~~Requires openxla/xla#15636~~
Requires openxla/xla#18450

trevor-m · 2024-08-06T18:08:02Z

@nouiz

docs/persistent_compilation_cache.md

nouiz · 2024-08-06T20:20:28Z

docs/persistent_compilation_cache.md

+
+   * `none`: don't enable any extra XLA caching features
+
+   * `xla_gpu_kernel_cache_file`: only enable the kernel cache


What is the issue with that one? Should we document it here?
Is it just that it doesn't work, or that it give hash collision, or crashes?

There is a crash which looks like this:

2024-07-30 18:04:57.120490: I external/xla/xla/stream_executor/cuda/cuda_executor.cc:272] getting function input_concatenate_fusion from module 0x55d95b91f8d0 E0730 18:04:57.120755 52461 pjrt_stream_executor_client.cc:3067] Execution of replica 0 failed: NOT_FOUND: Failed to get module function: CUDA_ERROR_NOT_FOUND: named symbol not found

@sergachev is taking a look at it and found it can be reproduced using bazel test --test_env=XLA_FLAGS="--xla_gpu_enable_llvm_module_compilation_parallelism --xla_gpu_kernel_cache_file=/dev/shm/xla.kernel.cache" tests/compilation_cache_test_gpu

Note, that issue is fixed in XLA: openxla/xla#15998
But I think it is better to not able all XLA caches at the same time for better testing.
We can have a follow up PR to expand it after the next JAX releases.

nouiz · 2024-08-20T23:50:41Z

The required XLA PR is merged: openxla/xla#15636
@hawkinsp can you review this PR?

hawkinsp

The change is fine, but there are CI failures (possibly stale).

trevor-m · 2024-09-25T17:26:34Z

The change is fine, but there are CI failures (possibly stale).

@hawkinsp Thanks for reviewing! I've rebased which should fix the CI failures

hawkinsp · 2024-09-27T12:13:24Z

One more thing: please squash your commits.

dfm · 2024-10-16T17:59:54Z

@trevor-m — Thanks for your patience here! Can you rebase your PR onto the current main branch? We'll get this in ASAP after that. Thanks!

trevor-m · 2024-10-16T18:52:08Z

@dfm Thanks for looking at this. However, we may need to hold off merging this a bit longer. We think there will be issues when using this feature with multihost. To solve it, we can set xla_gpu_experimental_autotune_cache_mode to update for rank 0 only and set it to read for the other ranks. We will need to expose that flag in the xla python bindings first.

We will need to do something similar for the kernel cache.

trevor-m · 2024-10-17T18:04:23Z

@dfm I've opened openxla/xla#18450 to expose the cache mode and updated this PR to set it to update for process 0 and read-only for the other processes. I confirmed this fixes the issue with multihost.

jax/_src/compiler.py

jax/_src/config.py

Add unit test Fix typechecker Set caching mode depending on process id

nouiz reviewed Aug 6, 2024

View reviewed changes

trevor-m force-pushed the cache branch from cb125c0 to 3dfab26 Compare August 8, 2024 19:56

sergachev mentioned this pull request Aug 15, 2024

[GPU] Fix kernel cache for loaded executables. openxla/xla#15998

Closed

nouiz requested a review from hawkinsp August 20, 2024 23:50

hawkinsp requested changes Sep 11, 2024

View reviewed changes

trevor-m force-pushed the cache branch from 3dfab26 to 923834d Compare September 25, 2024 17:12

trevor-m force-pushed the cache branch from 8a8f019 to 51850c9 Compare September 27, 2024 16:27

mattjj added the pull ready Ready for copybara import and testing label Oct 16, 2024

trevor-m force-pushed the cache branch from 51850c9 to 18d7993 Compare October 16, 2024 18:05

trevor-m force-pushed the cache branch 2 times, most recently from 49ca931 to a29d528 Compare October 17, 2024 17:42

trevor-m mentioned this pull request Oct 17, 2024

Add python bindings for xla_gpu_experimental_autotune_cache_mode openxla/xla#18450

Open

dfm self-assigned this Oct 17, 2024

nouiz reviewed Oct 17, 2024

View reviewed changes

jax/_src/compiler.py Outdated Show resolved Hide resolved

jax/_src/config.py Outdated Show resolved Hide resolved

When caching is enabled, also enable XLA caching features as well

23b14c3

Add unit test Fix typechecker Set caching mode depending on process id

trevor-m force-pushed the cache branch from a29d528 to 23b14c3 Compare October 17, 2024 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When caching is enabled, also enable XLA caching features as well #22899

When caching is enabled, also enable XLA caching features as well #22899

trevor-m commented Aug 6, 2024 •

edited

Loading

trevor-m commented Aug 6, 2024

nouiz Aug 6, 2024

trevor-m Aug 7, 2024

nouiz Aug 21, 2024

nouiz commented Aug 20, 2024

hawkinsp left a comment

trevor-m commented Sep 25, 2024

hawkinsp commented Sep 27, 2024

dfm commented Oct 16, 2024

trevor-m commented Oct 16, 2024

trevor-m commented Oct 17, 2024


		* `none`: don't enable any extra XLA caching features

		* `xla_gpu_kernel_cache_file`: only enable the kernel cache

When caching is enabled, also enable XLA caching features as well #22899

Are you sure you want to change the base?

When caching is enabled, also enable XLA caching features as well #22899

Conversation

trevor-m commented Aug 6, 2024 • edited Loading

trevor-m commented Aug 6, 2024

nouiz Aug 6, 2024

Choose a reason for hiding this comment

trevor-m Aug 7, 2024

Choose a reason for hiding this comment

nouiz Aug 21, 2024

Choose a reason for hiding this comment

nouiz commented Aug 20, 2024

hawkinsp left a comment

Choose a reason for hiding this comment

trevor-m commented Sep 25, 2024

hawkinsp commented Sep 27, 2024

dfm commented Oct 16, 2024

trevor-m commented Oct 16, 2024

trevor-m commented Oct 17, 2024

trevor-m commented Aug 6, 2024 •

edited

Loading