Error: Cannot extract graph node from different trace level, got Param(... #4804

lrvdijk · 2025-07-04T19:00:59Z

lrvdijk
Jul 4, 2025

Hi all,

I am trying to get more experience with implementing (protein) language models in Flax/Jax, and have been working on a re-implementation of Facebook's ESM-2 protein language model. ESM-2 is a fairly standard BERT-style encoder using RoPE for positional embeddings.

However, I have been unsuccessful so far with implementing a stack of transformer layers using nnx.vmap and nnx.scan. When trying to run the model, I get the following error:

esm2_alphabet = <flamino.vocab.Alphabet object at 0x108e76510>

    def test_esm2(esm2_alphabet: Alphabet):
        rngs = nnx.Rngs(0)
        model = ESM2(esm2_alphabet, 8, 64, 8, rngs=rngs)

        batched_model = nnx.vmap(model)

        seq = jnp.array(esm2_alphabet.tokenize_to_arr("AAFGG"))  # (batch, sequence)
>       embeddings = batched_model(seq)
                     ^^^^^^^^^^^^^^^^^^

tests/test_esm2.py:14:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.13/site-packages/flax/nnx/graph.py:1817: in update_context_manager_wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/flax/nnx/transforms/iteration.py:348: in vmap_wrapper
    pure_args_out, pure_out = vmapped_fn(*pure_args)
                              ^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/flax/nnx/transforms/iteration.py:173: in __call__
    out = self.f(*args)
          ^^^^^^^^^^^^^
src/flamino/models/esm2/__init__.py:61: in __call__
    x, _ = apply_transformer_layer(x, self.transformer_layers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/flax/nnx/graph.py:1817: in update_context_manager_wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/flax/nnx/transforms/iteration.py:1213: in scan_wrapper
    pure_args: tuple = extract.to_tree(
.venv/lib/python3.13/site-packages/flax/nnx/extract.py:217: in to_tree
    check_consistent_aliasing(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

node = <[KeyError(4458765952) raised in repr()] TransformerEncoder object at 0x109c65bd0>, prefix = 0

    def check_consistent_aliasing(
      node: tp.Any,
      prefix: tp.Any,
      /,
      *,
      node_prefixes: dict[tp.Any, list[tuple[PathParts, tp.Any]]] | None = None,
    ):
      if node_prefixes is None:
        node_prefixes = {}

      # collect all paths and prefixes for each node
      for path, value in graph.iter_graph(node):
        if graph.is_graph_node(value) or isinstance(value, graph.Variable):
          if isinstance(value, Object):
            value._check_valid_context(
              lambda: f'Trying to extract graph node from different trace level, got {value!r}'
            )
          if isinstance(value, graph.Variable):
            if not value._trace_state.is_valid():
>             raise ValueError(
                f'Cannot extract graph node from different trace level, got {value!r}'
              )
E             ValueError: Cannot extract graph node from different trace level, got Param( # 512 (2.0 KB)
E               value=Array([[[0., 0., 0., 0., 0., 0., 0., 0.],
E                       [0., 0., 0., 0., 0., 0., 0., 0.],
E                       [0., 0., 0., 0., 0., 0., 0., 0.],
                        ...
E                       [0., 0., 0., 0., 0., 0., 0., 0.]]], dtype=float32)
E             )
E             --------------------
E             For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

.venv/lib/python3.13/site-packages/flax/nnx/extract.py:62: ValueError

I am unsure what I am doing wrong. I'll walk through my (naive?) approach in implementing ESM-2, could anyone spot where I am making a mistake?

ESM-2 main class

The ESM-2 model comprises an embedding layer, a stack of transformer layers, an additional layer norm, and a final logit output layer.

I am using nnx.vmap to create the stack of transformer layer, and nnx.scan to iteratively apply each layer to an input x.

class ESM2(nnx.Module):
    def __init__(
        self,
        alphabet: Alphabet,
        num_layers: int = 33,
        d_embed: int = 1024,
        num_heads: int = 16,
        *,
        rngs: nnx.Rngs
    ):
        # Token -> initial embedding
        self.embed: nnx.Embed = nnx.Embed(len(alphabet.tokens), d_embed, rngs=rngs)

        # BERT-style transformer layers with rotary positional encoding embeddings
        # Split RNG generator #layers, and use vmap to generate each layer
        @nnx.split_rngs(splits=num_layers)
        @nnx.vmap(axis_size=num_layers)
        def create_layers(rngs: nnx.Rngs):
            return TransformerEncoder(d_embed, d_embed * 4, num_heads, rngs=rngs)

        self.transformer_layers = create_layers(rngs)
        self.layer_norm_after: nnx.LayerNorm = nnx.LayerNorm(d_embed, rngs=rngs)

        self.logit_head: LogitHead = LogitHead(d_embed, len(alphabet.tokens), rngs=rngs)

    def __call__(self, tokens: jax.Array):
        # Token -> initial embedding
        x = self.embed(tokens)

        # Use nnx.scan to repeatedly apply each transformer layer
        @nnx.scan
        def apply_transformer_layer(x: jax.Array, layer: TransformerEncoder):
            return layer(x), None

        x, _ = apply_transformer_layer(x, self.transformer_layers)

        x = self.layer_norm_after(x)
        logits = self.logit_head(x)

        return logits

Transformer layer with RoPE

I am trying to reuse nnx.MultiHeadAttention as much as possible, so my idea to include RoPE was to pass a custom attenion_fn to nnx.MultiHeadAttention that applies RoPE to the queries and keys before calling NNX's builtin dot_product_attention. The RoPE module is a custom module described below.

        self.rope: RoPE = RoPE(d_embed // num_heads, rngs=rngs)
        attention_fn = partial(dot_product_attention, self.rope) # This my custom dot_product_attention, that accepts a RoPE module as first argument
        
        self.attention: nnx.MultiHeadAttention = nnx.MultiHeadAttention(
            num_heads=num_heads,
            in_features=d_embed,
            attention_fn=attention_fn,
            decode=False,
            rngs=rngs
        )

(Full source: https://github.com/lrvdijk/flamino/blob/main/src/flamino/transformer.py)

Custom dot_product_attention:

def dot_product_attention(
    rope_module: RoPE,
    query: jax.Array,
    key: jax.Array,
    value: jax.Array,
    ... # Other parameters
):
    """Drop-in replacement for Flax's dot_product_attention, but with RoPE applied to the query and key."""
    
    batch_rope = nnx.vmap(rope_module)
    query = batch_rope(query)
    key = batch_rope(key)
    
    return nnx_dot_product_attention(
        query, 
        key, 
        value, 
        ... # other parameters
    )

RoPE implementation

Inspired by equinox's RoPE implementation, I cache computed sin/cos arrays into a global dict:

class RoPE(nnx.Module):
    ...

    def __call__(self, x: jax.Array):
        assert x.ndim == 2
        seq_len, embed_size = x.shape
        assert embed_size == self.d_embed, "Sequence embedding dimension mismatch"
        
        with jax.ensure_compile_time_eval():
            cache_key = (embed_size, x.dtype)
            
            # Check global cache for the given embedding size and dtype
            if cache_key not in rope_sin_cos_table_cache:
                sin_table, cos_table = self._compute_sin_cos_table(seq_len, x.dtype)
                rope_sin_cos_table_cache[cache_key] = (sin_table, cos_table)
            else:
                sin_table, cos_table = rope_sin_cos_table_cache[cache_key]
                
            # Re-compute sin/cos tables if length of the current sequence is greater
            freq_seq_len = sin_table.shape[0]
            if freq_seq_len < seq_len:
                sin_table, cos_table = self._compute_sin_cos_table(seq_len, x.dtype)
                rope_sin_cos_table_cache[cache_key] = (sin_table, cos_table)
        
        return apply_rope(x, sin_table, cos_table)

(Full source: https://github.com/lrvdijk/flamino/blob/main/src/flamino/rope.py)

Wrap-up

Given the exception is raised within the function apply_transformer_layer in the class ESM2, I suspect there's some interaction between nnx.scan and the Transformer/RoPE module that I do not understand. Also, excuse me for not making this a more concise example, but I am unsure which component is the culprit.

How can I fix this error?

Answered by cgarciae

Jul 5, 2025

Hey @lrvdijk! This looks great, lets try to get it working. First thing to note is that currently its not a good idea to transform instance methods e.g.

batched_model = nnx.vmap(model)

as here you are passing self in def __call__(self, ...) as a capture and this triggers the trace level error when trying to mutate Modules or Variables as NNX cannot keep track of these changes. The recommended approach is create a function that has the model as an explicit input and transform that:

@nnx.vmap(in_axes=(None, 0)) 
def forward(model, x):
  return model(x)

Here we assume you want to broadcast model. Same thing would apply for batch_rope.

Try to fix this part and we can solve the rest.

View full answer

cgarciae · 2025-07-05T03:07:39Z

cgarciae
Jul 5, 2025
Maintainer

Hey @lrvdijk! This looks great, lets try to get it working. First thing to note is that currently its not a good idea to transform instance methods e.g.

batched_model = nnx.vmap(model)

as here you are passing self in def __call__(self, ...) as a capture and this triggers the trace level error when trying to mutate Modules or Variables as NNX cannot keep track of these changes. The recommended approach is create a function that has the model as an explicit input and transform that:

@nnx.vmap(in_axes=(None, 0)) 
def forward(model, x):
  return model(x)

Here we assume you want to broadcast model. Same thing would apply for batch_rope.

Try to fix this part and we can solve the rest.

1 reply

lrvdijk Jul 7, 2025
Author

Thanks for the quick reply! Your suggestions put me on the right path and I was able to get it to work. Thank you!

The main issue in the end was my approach to applying RoPE to the queries and keys before the attention weight computation. I originally configured a custom attention_fn in MultiHeadAttention, which was a modified version of nnx.dot_product_attention that applied RoPE to queries and keys before calling the original nnx.dot_product_attention. The issue is that I had to configure that function in the constructor of MultiHeadAttention as a partial(dot_product_attention, rope_instance). This function thus captures rope_instance within the constructor.

Within __call__, the call to self.attention would then internally call the configured attention_fn, which captured the rope instance, but which is not explicitly included in the __call__ tracing context, hence the error.

My new approach involves subclassing nnx.MultiHeadAttention and overriding the __call__ method to add options for processing queries, keys, and values after projection. Now I can create partial functions that explicitly include the rope instance in the correct tracing context.

If there are better ways to implement this in Flax NNX, please let me know! I am also open to creating a PR to add the qkv post-processing API to MultiHeadAttention if you think that would be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error: Cannot extract graph node from different trace level, got Param(... #4804

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Error: Cannot extract graph node from different trace level, got Param(... #4804

Uh oh!

lrvdijk Jul 4, 2025

ESM-2 main class

Transformer layer with RoPE

RoPE implementation

Wrap-up

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

cgarciae Jul 5, 2025 Maintainer

Uh oh!

Uh oh!

lrvdijk Jul 7, 2025 Author

lrvdijk
Jul 4, 2025

Replies: 1 comment 1 reply

cgarciae
Jul 5, 2025
Maintainer

lrvdijk Jul 7, 2025
Author