-
Notifications
You must be signed in to change notification settings - Fork 788
Description
I am trying to understand how does relative distance is computed for cross-attention in the T5 model. My understanding is based on the Hugginface T5 implementation.
Let's assume we have a source sequence of length 7 and a target sequence of length 5. In the cross-attention sublayer at each decoder layer, every token in the target sequence attends every token in the input sequence.
In the T5 model, we compute the relative distance to compute bias using query-len
and key-len
as in https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_t5.py#L289.
My question is, how the distance between two tokens are computed if they belong to the source and target sequence. The relative distance matrix (5 x 7) would look like:
tensor([[ 0, 1, 2, 3, 4, 5, 6],
[-1, 0, 1, 2, 3, 4, 5],
[-2, -1, 0, 1, 2, 3, 4],
[-3, -2, -1, 0, 1, 2, 3],
[-4, -3, -2, -1, 0, 1, 2]])
Once we put the distances into bucket for the cross-attention, it would look like:
tensor([[0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0],
[2, 1, 0, 0, 0, 0, 0],
[3, 2, 1, 0, 0, 0, 0],
[4, 3, 2, 1, 0, 0, 0]])
Given that cross-attention is a part of the decoder, the bidirectional
flag is set to False. So, it means while decoding at step i
, the decoder will treat all the source tokens at position i, i+1, i+2, ...
having a distance 0
from the target token at position i
. Is this correct?