Why startup times are important for some users
In production some use autoscaling in order to increase / decrease servers in response to demand.
In this scenario startup times are super important because then you can respond quickly to changes in demand without users experiencing worse experience during changes. A build up of pending servers can cause technical issues like network overload.
the issue
When debugging our startup times I saw that populating the cos-sin cache in PagedAttentionWithRope is repeated work that is responsible for a lot of the startup times. For a 13B model it was responsible for about 35s out of 90s total model load times.