How does relative distance is computed for cross-attention in T5 model?

I am trying to understand how does relative distance is computed for cross-attention in the T5 model. My understanding is based on the Hugginface T5 implementation.

Let's assume we have a source sequence of length 7 and a target sequence of length 5. In the cross-attention sublayer at each decoder layer, every token in the target sequence attends every token in the input sequence.

In the T5 model, we compute the relative distance to compute bias using `query-len` and `key-len` as in https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_t5.py#L289.

My question is, how the distance between two tokens are computed if they belong to the source and target sequence. The relative distance matrix (5 x 7) would look like:

```
tensor([[ 0,  1,  2,  3,  4,  5,  6],
        [-1,  0,  1,  2,  3,  4,  5],
        [-2, -1,  0,  1,  2,  3,  4],
        [-3, -2, -1,  0,  1,  2,  3],
        [-4, -3, -2, -1,  0,  1,  2]])
```

Once we put the distances into bucket for the cross-attention, it would look like:

```
tensor([[0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0],
        [2, 1, 0, 0, 0, 0, 0],
        [3, 2, 1, 0, 0, 0, 0],
        [4, 3, 2, 1, 0, 0, 0]])
``` 

Given that cross-attention is a part of the decoder, the `bidirectional` flag is set to False. So, it means while decoding at step `i`, the decoder will treat all the source tokens at position `i, i+1, i+2, ...` having a distance `0` from the target token at position `i`. Is this correct?

[Stackoverflow post link](https://stackoverflow.com/questions/63682319/how-does-relative-distance-is-computed-for-cross-attention-in-t5-model)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does relative distance is computed for cross-attention in T5 model? #415

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How does relative distance is computed for cross-attention in T5 model? #415

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions