Skip to content

CI: 05/28/25 upstream sync #436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1,706 commits into
base: rocm-main
Choose a base branch
from

Conversation

rocm-repo-management-api-2[bot]
Copy link

Daily sync with upstream

hawkinsp and others added 30 commits May 14, 2025 20:41
A lot of this logic was confusing phrased as conditions over both CPU
and GPU build flags. But we can decompose it:
* dependencies we add for CPU tests, and
* additional dependencies we add for GPU tests.

While we are here, also add the necessary pypi dependency for TPU tests.
hold references to raw buffers instead of PjRtBuffers.

This fixes an issue where the buffers can be deleted before
the transfer is complete, but introduces another problem where
if they are donated it will now silently read from donated arrays.

Once the underlying runtime exposes usage holds properly, this
new codepath should take a usage hold and the old pjrtbuffer
path should be removed.

PiperOrigin-RevId: 758819621
These had been accidentally broken at some point in the plugin
switchover..
`slice` is not hashable before Python 3.12. This change mitigates it by
converting it into a hash value.

PiperOrigin-RevId: 758905560
We must not depend on the nvidia_nvshmem_cu12 pip package directly since it does not exist on Windows and Mac platforms.

PiperOrigin-RevId: 758917499
The errors are too verbose and mostly not very useful.

PiperOrigin-RevId: 759025165
We weren't handling them correctly meaning you couldn't use a `shard_map`/`ManualComputationOp` which has callbacks inside.

PiperOrigin-RevId: 759072597
The "add a token" part of the `callback` primitive's MLIR lowering was incorrectly adding a ranked sharding by using the sharding of a ranked tensor. So instead create an unranked sharding explicitly

PiperOrigin-RevId: 759135477
shouldn't affect existing behaviors, or trace time

The main implementation ideas:
* each Trace is tagged with a `requires_low: bool`
* each Jaxpr
  * is tagged with an `is_high: bool`, default False but set True while tracing
    if any hijax primitives are encountered
  * includes an `mut_types: dict[Var, HijaxType]` indicating final types for
    type-changing mutable hijax types
* each AbstractValue is tagged by a `mutable: bool` which is read to populate
  `mut_types`
* each Primitive
  * has an `is_high(**params) -> bool` method (depends on params for HOPs)
  * has a `to_lojax(*args, **params)` method taking and returning
    hijaxtypes-wrapping-lowtracers
* in `Primitive.bind`, we check if `prim.is_high(**params) and
  trace.requires_low`, and if so we call `prim.to_lojax`

Co-authored-by: Dougal Maclaurin <dougalm@google.com>
PiperOrigin-RevId: 759336328
…tly it looks like this.

```
ValueError: Pytree for `in_specs` and inputs do not match. There are 1 mismatches, including:
    * `in_specs` is a tuple of length 1 but inputs is a tuple of length 4, so the lengths do not match

```

PiperOrigin-RevId: 759499528
…t_dict_merge

PiperOrigin-RevId: 759579563
The implementation currently forces O=0 due to a suspected bug in the NVPTX
backend.

To get source information

* Set MOSAIC_GPU_LINE_INFO=1
* Run with --jax_include_full_tracebacks_in_locations=true

PiperOrigin-RevId: 759608368
Google-ML-Automation and others added 28 commits May 26, 2025 04:16
The C128 matmuls will be routed to cuBLAS rather than to be handled by the loop emitter, causing a very slight numerical difference.
Therefore, don't be very strict in the comparison.

PiperOrigin-RevId: 763397887
…om-ptxas-and-llvm

PiperOrigin-RevId: 763701410
…yout in some ops

I can't explain it, but if we don't do it then the verifier sometimes fails...
I'm not even sure how to properly trigger this in a test right now, but worst case it
would result in more verifier failures to fix, so I think it's fine to merge as is.

PiperOrigin-RevId: 763711454
I thought this doesn't work, but it does! Still, adding a test to make sure
we don't regress it.

PiperOrigin-RevId: 763717665
If we don't synchronize the warps, some of them can go on and schedule
e.g. async copies without waiting for the memory transactions of other
warps in the warpgroup to complete.

PiperOrigin-RevId: 763721411
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, and leads to improved build and iteration times.

This was unblocked by moving ad, batching, and custom_transpose to their own rules in prior changes. It required one small code refactoring: moving an effects registration to the location where the effect is defined.

PiperOrigin-RevId: 763736189
…TPU interpret mode.

Since dimensions with parallel semantics must now appear as the leading dimensions of the grid, this CL also makes the sequential iteration over cores in the simulation never re-visit a core after the simulation has moved on to the next core. This enables the simulation to correctly omit loads and stores of kernel buffers if the same (slice of a) buffer is processed by multiple kernel invocations on the same core.

PiperOrigin-RevId: 763737647
We already call `xla::sdy::addSdyRoundTripExportPipeline` in `xla::SerializeUsingVersionedStablehlo` so no need for this anymore.

PiperOrigin-RevId: 763762358
Just to give us extra confidence while we make changes.

PiperOrigin-RevId: 763767275
We sometimes access NVSHMEM functions from the host code too, which means
we should include the NVSHMEM host library in the context of the ExecutionEngine.

PiperOrigin-RevId: 763777731
This will make it much simpler to make the kernel persistent.

PiperOrigin-RevId: 763782577
Before this fix, the test would finish before execution was done, and profiling would thus yield nothing.

PiperOrigin-RevId: 763783695
…ToXlaComputation`.

PiperOrigin-RevId: 763837933
…nsertion

Enabling this flag can introduce races into certain kernels, which is why it's
False by default. Still, there's plenty of kernels where it's unnecessary and
a few of those suffer performance regressions when it is on. So it makes sense
to at least allow users to opt out.

PiperOrigin-RevId: 763853668
PiperOrigin-RevId: 763862020
Previously the result of vmapped RA2A was concatenating a flattened result.

PiperOrigin-RevId: 763958632
@rocm-repo-management-api-2 rocm-repo-management-api-2 bot requested a review from a team as a code owner May 28, 2025 06:02
@rocm-repo-management-api-2 rocm-repo-management-api-2 bot enabled auto-merge (rebase) May 28, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.