Skip to content

Potential Speedup for model loading: populating sin-cos cache is slow / repeated  #977

@ri938

Description

@ri938

Why startup times are important for some users

In production some use autoscaling in order to increase / decrease servers in response to demand.

In this scenario startup times are super important because then you can respond quickly to changes in demand without users experiencing worse experience during changes. A build up of pending servers can cause technical issues like network overload.

the issue

When debugging our startup times I saw that populating the cos-sin cache in PagedAttentionWithRope is repeated work that is responsible for a lot of the startup times. For a 13B model it was responsible for about 35s out of 90s total model load times.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions