Basic all_gather implementation #1663

Chapaman · 2026-01-30T19:38:23Z

Implements Nx.Defn.Kernel.all_gather/2 to gather sharded tensor data across mesh partitions during distributed execution.
Changes
Nx
Add all_gather/2 in defn/kernel.ex and defn/expr.ex with sharding semantics
Add evaluator support for all_gather in defn/evaluator.ex
EXLA
Lower all_gather to stablehlo.all_gather in defn.ex and mlir/value.ex
Test
EXLA.Defn.ShardingTest: "generates correct MLIR with all_gather" checks MLIR generation and shard_jit output across a 2×2 mesh along axis 0 and 1

polvalente · 2026-01-30T19:51:24Z

exla/lib/exla/defn.ex

+    Value.all_gather(
+      [tensor],
+      expr_to_typespec(ans),
+      all_gather_dim,
+      replica_groups,
+      use_global_device_ids,
+      Keyword.take(opts, [:channel_id])
+    )
+    |> hd()


Let's hard match for now instead of hd (i.e. [result] = Value...)
And then add a comment that we might want to surface all_gather as an operation that takes a container of operands instead of a single one.

polvalente · 2026-01-30T19:52:03Z

exla/lib/exla/mlir/value.ex

+
+    attributes =
+      if opts[:channel_id] do
+        attributes ++ [channel_id: attr_i64(opts[:channel_id])]


Let's use Keyword.put instead of ++

polvalente · 2026-01-30T19:52:11Z

exla/lib/exla/mlir/value.ex

+      if opts[:channel_id] do
+        attributes ++ [channel_id: attr_i64(opts[:channel_id])]
+      else
+        attributes end


polvalente · 2026-01-30T19:52:47Z

exla/lib/exla/mlir/value.ex

    end
  end

+  def all_gather([%Value{function: func} | _] = operands, typespec, all_gather_dim, replica_groups, use_global_device_ids, opts \\ []) do


how about channel_id being a required argument and we just pass the value directly?

polvalente · 2026-01-30T19:55:52Z

nx/lib/nx/defn/evaluator.ex

+    if op == :all_gather and not function_exported?(mod, :all_gather, 3) do
+      raise ArgumentError,
+        "all_gather/3 is not supported by backend #{inspect(mod)}."
+    end


If we remove this, do we have a test verifying this raise? Also, I believe this is already checked elsewhere.

If it's not, it seems to me that this check should be more general

polvalente · 2026-01-30T19:56:44Z

nx/lib/nx/defn/expr.ex

+    _all_gather_dim = opts[:all_gather_dim]
+    replica_groups = opts[:replica_groups]
+
+    # Calculate group size (number of replicas per group)
+    _group_size =
+      case replica_groups do
+        [first_group | _] -> length(first_group)
+        [] -> 1
+      end
+
+    # Calculate output shape by multiplying the gather dimension by group_size
+    input_shape = tensor.shape
+    output_shape =
+      input_shape
+#      |> Tuple.to_list()
+#      |> List.update_at(all_gather_dim, &(&1 * group_size))
+#      |> List.to_tuple()
+
+    # Create output tensor with the new shape


There are a few unused values here due to the stray comments that should all be removed. Also, just pass tensor as out directly

polvalente · 2026-01-30T19:58:43Z

nx/lib/nx/defn/kernel.ex

+
+    * `tensor` - The input tensor to gather
+    * `all_gather_dim` - The dimension along which to gather
+    * `replica_groups` - 2D list defining how replicas are grouped (required)


I'm not sure if this is the terminology we want to surface here. For now, let's make the function all_gather(tensor, opts) and defer the documentation of opts to the specific backend or compiler.

And in EXLA we should add a new section to the moduledoc of EXLA describing Sharding

polvalente

This is looking great! I think we need more tests in both Nx and EXLA

Basic gall_gather implementation

9d1bd7c

Chapaman changed the title ~~Basic gall_gather implementation~~ Basic all_gather implementation Jan 30, 2026

polvalente reviewed Jan 30, 2026

View reviewed changes

changes due to code review by @polvalente

cc8761d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic all_gather implementation #1663

Basic all_gather implementation #1663

Chapaman commented Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026 •

edited

Loading

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026

Uh oh!

polvalente left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Basic all_gather implementation #1663

Are you sure you want to change the base?

Basic all_gather implementation #1663

Conversation

Chapaman commented Jan 30, 2026

Uh oh!

polvalente Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polvalente Jan 30, 2026 •

edited

Loading