[Mosaic GPU] Add CUPTI profiler alongside events-based implementation #24805

andportnoy · 2024-11-08T21:29:32Z

No description provided.

apaszke

Looks great!

apaszke · 2024-12-05T11:13:54Z

jaxlib/mosaic/gpu/mosaic_gpu_ext.cc

+  // take ownership of the buffer once CUPTI is done using it
+  std::unique_ptr<uint8_t> buffer = absl::WrapUnique(buffer_raw);
+  CUpti_Activity* record = nullptr;
+  for (;;) {


nit: while (true) is a little clearer imo?

Changed to while (true).

apaszke · 2024-12-05T12:02:59Z

jaxlib/mosaic/gpu/BUILD

@@ -188,6 +188,7 @@ pybind_extension(
        "@com_google_absl//absl/cleanup",
        "@com_google_absl//absl/strings",
        "@nanobind",
+        "@xla//xla/pjrt:exceptions",


Could we try to avoid this dependency? IIUC the only reason why it's here is so that you can throw XlaRuntimeError. But it would be perfectly ok to raise a different Python error. The dep right now causes some issues with the build internally

Replaced with std::runtime_error.
(XlaRuntimeError had the benefit that it respects JAX_TRACEBACK_FILTERING).

apaszke

Ok this looks great, but was flagged by our internal ASAN harness for two reasons:

If you use aligned new[], you must also use the aligned delete[] operator (currently the code uses an unaligned delete which is doubly bad). This is fixed in the following diff (+ some missing C++ headers):

diff --git a/jaxlib/mosaic/gpu/mosaic_gpu_ext.cc b/jaxlib/mosaic/gpu/mosaic_gpu_ext.cc
--- a/jaxlib/mosaic/gpu/mosaic_gpu_ext.cc
+++ b/jaxlib/mosaic/gpu/mosaic_gpu_ext.cc
@@ -13,9 +13,13 @@ See the License for the specific languag
 limitations under the License.
 ==============================================================================*/
 
-#include <memory>
+#include <cstddef>
+#include <cstdint>
+#include <new>
 #include <stdexcept>
 #include <string>
+#include <tuple>
+#include <vector>
 
 #include "nanobind/nanobind.h"
 #include "nanobind/stl/tuple.h"
@@ -162,13 +166,14 @@ void callback_request(uint8_t** buffer, 
 }
 
 void callback_complete(CUcontext context, uint32_t streamId,
-                       uint8_t* buffer_raw, size_t size, size_t validSize) {
+                       uint8_t* buffer, size_t size, size_t validSize) {
   // take ownership of the buffer once CUPTI is done using it
-  std::unique_ptr<uint8_t> buffer = absl::WrapUnique(buffer_raw);
+  absl::Cleanup cleanup = [buffer]() {
+    operator delete[](buffer, std::align_val_t(8));
+  };
   CUpti_Activity* record = nullptr;
   while (true) {
-    CUptiResult status =
-        cuptiActivityGetNextRecord(buffer.get(), validSize, &record);
+    CUptiResult status = cuptiActivityGetNextRecord(buffer, validSize, &record);
     if (status == CUPTI_SUCCESS) {
       if (record->kind == CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL) {
         // TODO(andportnoy) handle multi-GPU

It looks like CUPTI loves to leak memory, which makes our ASAN test harness very sad. Please make sure to separate CUPTI tests into a separate test target where we can disable ASAN, because we want to continue to make sure we don't leak memory otherwise.

andportnoy · 2024-12-06T20:49:08Z

Thanks, I applied the patch and moved CUPTI-specific tests to a separate target which I marked with tags = ["noasan"].

superbobry · 2024-12-09T15:17:25Z

jax/experimental/mosaic/gpu/profiler.py

+    case "cupti":
+      return _measure_cupti(f, aggregate)
+    case "events":
+      if aggregate == False:


Nit: if not aggregate?

Changed to if not aggregate.

superbobry · 2024-12-09T15:17:39Z

jax/experimental/mosaic/gpu/profiler.py

+  return wrapper
+
+
+def measure(f, mode="cupti", aggregate=True):


Maybe make both optional parameters keyword-only?

superbobry · 2024-12-09T15:18:30Z

jax/experimental/mosaic/gpu/profiler.py

@@ -69,7 +69,7 @@ def _event_elapsed(start_event, end_event):
  )(start_event, end_event)


-def measure(
+def _measure_events(
    f: Callable[P, T], *args: P.args, **kwargs: P.kwargs
 ) -> tuple[T, float]:
  """Measures the time it takes to execute the function on the GPU.


This docstring should be move to the public measure function and extended to explain mode and aggregate.

I added a detailed docstring to measure.

superbobry · 2024-12-09T15:18:49Z

jax/experimental/mosaic/gpu/profiler.py

+  return wrapper
+
+
+def measure(f, mode="cupti", aggregate=True):


Would you mind adding type annotations to measure?

superbobry · 2024-12-09T15:20:14Z

tests/mosaic/BUILD

+    deps = [
+        "//jax:mosaic_gpu",
+    ] + py_deps("absl/testing"),
+    tags = ["noasan"], # CUPTI leaks memory


Err, is this a bug in CUPTI or are you referring to the static global you are adding?

This is a bug in CUPTI. Could you also please add nomsan?

superbobry · 2024-12-09T15:21:05Z

tests/mosaic/profiler_cupti_test.py

+# pylint: disable=g-complex-comprehension
+config.parse_flags_with_absl()
+
+class ProfilerCuptiTest(parameterized.TestCase):


It might be better to do it in a follow up, but we probably need a single profiler test which would be parameterized over mode.

Idk the tests seem reasonable to me?

The two modes are fundamentally different in a pretty significant way (see measure docstring), so it felt more natural to write dedicated test cases.

apaszke

Please fix Sergei's comments too!

apaszke · 2024-12-09T15:22:51Z

tests/mosaic/BUILD

+    deps = [
+        "//jax:mosaic_gpu",
+    ] + py_deps("absl/testing"),
+    tags = ["noasan"], # CUPTI leaks memory


This is a bug in CUPTI. Could you also please add nomsan?

apaszke · 2024-12-09T15:24:03Z

tests/mosaic/profiler_cupti_test.py

+# pylint: disable=g-complex-comprehension
+config.parse_flags_with_absl()
+
+class ProfilerCuptiTest(parameterized.TestCase):


Idk the tests seem reasonable to me?

superbobry · 2024-12-09T18:54:05Z

jax/experimental/mosaic/gpu/profiler.py

+  return wrapper
+
+
+def measure(f: Callable, *, mode: str = "cupti", aggregate: bool = True):


FYI you can be a bit more precise with the types here:

def measure(f: Callable[P, T], ...) -> Callable[P, tuple[T, float]]: ...

That return type would be incorrect when aggregate=False. Wouldn't hurt to add -> Callable return type though, right?

Added -> Callable.

Well, if you really want it, you can define overloads for measure with literal types for aggregate, but I have mixed feelings about having different return types tbh as mentioned in the other thread.

I have mixed feelings about having different return types

Why though? The default value of aggregate is True, which means for both modes the return types are the same. The user needs to consciously type out aggregate=False to actually get that array of tuples instead of the default aggregate value.

superbobry · 2024-12-09T18:54:50Z

jax/experimental/mosaic/gpu/profiler.py

+  def wrapper(*args, **kwargs):
+    mosaic_gpu_lib._mosaic_gpu_ext._cupti_init()
+    try:
+      results = jax.block_until_ready(jax.jit(f)(*args, **kwargs))


Shall we jit f in the enclosing namespace?

Hmm what do you mean?

Sorry, I meant

jit_f = jax.jit(f)

just before the definition of wrapper.

Why? If you are thinking about performance, jax.jit doesn't actually JIT until the invocation anyway, right? And if the function has been compiled for the shapes and types before, then it's a quick cache look up anyway.

What am I missing?

superbobry · 2024-12-09T18:56:51Z

jax/experimental/mosaic/gpu/profiler.py

+      timings = mosaic_gpu_lib._mosaic_gpu_ext._cupti_get_timings()
+    if not timings:
+      return results, None
+    elif aggregate:


Is it useful to have aggregate=False? Can we assume its true so that the return type of measure is the same for both modes?

Definitely useful, I'd rather keep it.

The user can always aggregate manually, it's just one comprehension away.

Curious what @apaszke thinks as well.

The user can always aggregate manually

This seems to suggest "let's make aggregate=False the only option" because the user can aggregate themselves, but then the return types are going to be different between the two modes.

Can we assume its true so that the return type of measure is the same for both modes?

This seems to suggest making aggregate=True the only option to make the return types uniform.

Am I missing something or these are contradictory?

I think it's valuable to be able to look at precise individual (disaggregated) kernel timings, this is a crucial bit of functionality that you can only get with CUPTI and not with events. Keeping it is more important than making the return types uniform.

We settled on this design (aggregate/summed timings by default with an option to see individual timings) with @apaszke over DMs over a month ago, but I should have posted more widely so we could have had this discussion earlier :)

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch from 38419c3 to 86ba114 Compare November 8, 2024 22:36

andportnoy marked this pull request as ready for review November 8, 2024 22:36

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch 2 times, most recently from 775078a to 061d3c1 Compare November 12, 2024 20:31

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch 2 times, most recently from 8e5168c to 754c58e Compare November 20, 2024 03:53

andportnoy mentioned this pull request Nov 27, 2024

[Mosaic GPU] Improve default kernel name and add option to customize #25006

Merged

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch from 754c58e to a7d2562 Compare December 4, 2024 16:19

apaszke approved these changes Dec 5, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Dec 5, 2024

kokoro-team removed the kokoro:force-run label Dec 5, 2024

apaszke reviewed Dec 5, 2024

View reviewed changes

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch 2 times, most recently from d89fc98 to e02cc0e Compare December 5, 2024 19:42

andportnoy requested a review from apaszke December 5, 2024 19:46

apaszke requested changes Dec 6, 2024

View reviewed changes

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch from e02cc0e to 7836d89 Compare December 6, 2024 20:47

andportnoy requested a review from apaszke December 6, 2024 20:48

superbobry reviewed Dec 9, 2024

View reviewed changes

apaszke approved these changes Dec 9, 2024

View reviewed changes

google-ml-butler bot added the kokoro:force-run label Dec 9, 2024

kokoro-team removed the kokoro:force-run label Dec 9, 2024

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch from 7836d89 to 5f580af Compare December 9, 2024 18:03

andportnoy requested review from superbobry and apaszke December 9, 2024 18:07

superbobry reviewed Dec 9, 2024

View reviewed changes

[Mosaic GPU] Add CUPTI profiler alongside events-based implementation

cc22334

andportnoy force-pushed the aportnoy/mosaic-gpu-cupti-profiler branch from 5f580af to cc22334 Compare December 9, 2024 19:31

copybara-service bot merged commit 0d7eaeb into jax-ml:main Dec 11, 2024
11 of 12 checks passed

andportnoy deleted the aportnoy/mosaic-gpu-cupti-profiler branch December 11, 2024 14:52

		return wrapper


		def measure(f: Callable, *, mode: str = "cupti", aggregate: bool = True):

[Mosaic GPU] Add CUPTI profiler alongside events-based implementation #24805

[Mosaic GPU] Add CUPTI profiler alongside events-based implementation #24805

Conversation

andportnoy commented Nov 8, 2024

apaszke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apaszke left a comment

Choose a reason for hiding this comment

andportnoy commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apaszke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment