[JAX][DOC] Add optimizer state offloading doc #28988

zhenying-liu · 2025-05-23T20:43:42Z

Add the optimizer state offloading with a code example.
Add the memory usage comparison for activation/parameter/optimizer state offloading with their baseline implementations. The memory stats were collected on a GPU.

zhenying-liu · 2025-05-23T20:48:02Z

Its colab link: https://colab.research.google.com/drive/1Pq-kgJh_j0cWr_501SboqscjwLf8Xas3.

yashk2810 · 2025-05-27T15:45:29Z

docs/notebooks/host-offloading.md


-By applying offloading strategies, you can better manage memory resources and reduce memory pressure on your devices. To implement these strategies effectively, you'll need to understand JAX's core mechanisms for data placement and movement.
+By applying offloading strategies, you can better manage memory resources and reduce memory pressure on your devices. To implement these strategies effectively, you'll need to understand JAX's core mechanisms for data placement and movement. However, offloading may degrade performance due to memory transfers between host and device, so it's important to consider this trade-off when designing your optimization strategy.


I wouldn't write this statement. With good overlap, you might not see any degradation right?

We did see a lot of degradation on GPUs, especially parameter and optimizer state offloading. Activation offloading is optimized recently, but its performance is still worse than no offloading. So we want to tell the user about this.

Instead of completely removing this sentence, can we let the user be aware of the performance concern? @yashk2810 @jreiffers
"Note that offloading performance may vary significantly across device types."

You mention this below in the optimizer offloading section which is fine. No need to mention it again at the top. WDYT?

Got it. Removed here at the top and mentioned in the "Limitations of Parameter Offloading" session.

yashk2810 · 2025-05-27T16:04:02Z

docs/notebooks/host-offloading.md

+
+### Basic Implementation
+
+In this section, you will implement a simple model with the Adam optimizer. This implementation will help you understand the baseline behavior before exploring optimizer state offloading. It is particularly useful for understanding memory patterns in large-scale neural network training.


"In this section, let's implement a simple model with the Adam optimizer". In general, prefer not using "you"

Done. Not use "you" in this document.

yashk2810 · 2025-05-28T17:04:34Z

docs/notebooks/host-offloading.md

+This implementation demonstrates how to:
+1. Set up sharding specifications for `device` and `pinned_host`
+2. Move optimizer states between host and device memory via {func}`jax.device_put`
+3. Use `in_sharding` and `out_shardings` to ensure proper memory placement


in_sharding shouldn't be used. Can you remove that please? The arrays should be on the correct memory kind and sharding.

Removed all the in_shardings. Verified in the colab that the all the code still running expectedly.
So in_shardings is indeed redundant.

yashk2810 requested changes May 27, 2025

View reviewed changes

zhenying-liu force-pushed the opt_offload branch 6 times, most recently from 712601b to d7dc38e Compare May 28, 2025 00:00

yashk2810 approved these changes May 28, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels May 28, 2025

yashk2810 reviewed May 28, 2025

View reviewed changes

zhenying-liu added 4 commits May 28, 2025 13:29

Add optimizer state offloading doc

d1ab89e

Fix the layer function names and not use "you"

f961c53

Fix the performance hints

cb9ba76

remove in_shardings

4971813

zhenying-liu force-pushed the opt_offload branch from d7dc38e to 4971813 Compare May 28, 2025 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX][DOC] Add optimizer state offloading doc #28988

[JAX][DOC] Add optimizer state offloading doc #28988

zhenying-liu commented May 23, 2025

Uh oh!

zhenying-liu commented May 23, 2025

Uh oh!

yashk2810 May 27, 2025

Uh oh!

zhenying-liu May 27, 2025

Uh oh!

zhenying-liu May 27, 2025

Uh oh!

yashk2810 May 27, 2025

Uh oh!

zhenying-liu May 28, 2025

Uh oh!

yashk2810 May 27, 2025

Uh oh!

zhenying-liu May 27, 2025

Uh oh!

yashk2810 May 28, 2025

Uh oh!

zhenying-liu May 28, 2025

Uh oh!

Uh oh!


		By applying offloading strategies, you can better manage memory resources and reduce memory pressure on your devices. To implement these strategies effectively, you'll need to understand JAX's core mechanisms for data placement and movement.
		By applying offloading strategies, you can better manage memory resources and reduce memory pressure on your devices. To implement these strategies effectively, you'll need to understand JAX's core mechanisms for data placement and movement. However, offloading may degrade performance due to memory transfers between host and device, so it's important to consider this trade-off when designing your optimization strategy.


		### Basic Implementation

		In this section, you will implement a simple model with the Adam optimizer. This implementation will help you understand the baseline behavior before exploring optimizer state offloading. It is particularly useful for understanding memory patterns in large-scale neural network training.

[JAX][DOC] Add optimizer state offloading doc #28988

Are you sure you want to change the base?

[JAX][DOC] Add optimizer state offloading doc #28988

Conversation

zhenying-liu commented May 23, 2025

Uh oh!

zhenying-liu commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!