[TPU] XLA fails to fuse embedding lookup / array indexing

https://github.com/patrick-kidger/equinox/blob/7ee4ca944d75c33d1403122f7ccf141bc390a55e/equinox/nn/_embedding.py#L100

I'm using `equinox`, and Internally `eqx.nn.Embedding` is just naively indexing (as shown in above link). However, this is subpar as XLA is unable to fuse `vmap(embed_layer)` calls, instead doing hundreds of thousands of dynamic slice updates over the `weight` array:

![image](https://github.com/user-attachments/assets/9f863720-9d59-4da2-b980-4ced88269537)

Zooming in, we see this repetitive block pattern repeated thousands of times:
![image](https://github.com/user-attachments/assets/f8b2cc78-09ab-428c-9635-5aac364ca713)

Instead, we can force `XLA` to fuse by:
```diff
- return self.weight[x]
+ return jnp.take(self.weight, x, axis=0)
```

![image](https://github.com/user-attachments/assets/d28e14a6-af98-4d8f-af02-e0c521cc0254)

Which fixes the issue and yields a ~25% improvement in throughput.

Here's a simple [**colab repro**](https://colab.research.google.com/drive/1GNeQSb6NFkXCPPDmxpEQKu6I6UtGgvAK?usp=sharing#scrollTo=lIYdn1woOS1n) that records 2 tensorboard traces; Note that the blocks for naive lookup are too small so one may have to zoom in into the trace.

Why does `XLA` fail to fuse/parallelize naive indexing compared to `jnp.take`? 
Why is the jaxpr generated by `jnp.take` containing `Pjit` but the naive indexing does not?

If those ops are equivalent, surely `XLA` would be able to optimize them? 🤔 

```[tasklist]
### Tasks
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] XLA fails to fuse embedding lookup / array indexing #20899

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development