Skip to content

How does relative distance is computed for cross-attention in T5 model? #415

@wasiahmad

Description

@wasiahmad

I am trying to understand how does relative distance is computed for cross-attention in the T5 model. My understanding is based on the Hugginface T5 implementation.

Let's assume we have a source sequence of length 7 and a target sequence of length 5. In the cross-attention sublayer at each decoder layer, every token in the target sequence attends every token in the input sequence.

In the T5 model, we compute the relative distance to compute bias using query-len and key-len as in https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_t5.py#L289.

My question is, how the distance between two tokens are computed if they belong to the source and target sequence. The relative distance matrix (5 x 7) would look like:

tensor([[ 0,  1,  2,  3,  4,  5,  6],
        [-1,  0,  1,  2,  3,  4,  5],
        [-2, -1,  0,  1,  2,  3,  4],
        [-3, -2, -1,  0,  1,  2,  3],
        [-4, -3, -2, -1,  0,  1,  2]])

Once we put the distances into bucket for the cross-attention, it would look like:

tensor([[0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0],
        [2, 1, 0, 0, 0, 0, 0],
        [3, 2, 1, 0, 0, 0, 0],
        [4, 3, 2, 1, 0, 0, 0]])

Given that cross-attention is a part of the decoder, the bidirectional flag is set to False. So, it means while decoding at step i, the decoder will treat all the source tokens at position i, i+1, i+2, ... having a distance 0 from the target token at position i. Is this correct?

Stackoverflow post link

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions