Increase the probability of balancing new regions #2524

nolouch · 2020-06-09T17:42:31Z

Feature Request

Describe your feature request related problem

There is an issue about the new region may have a low probability to be balanced，such as:

                             8
                         7   7
                     6   6   6  
                 5   5   5   5
             4   4   4   4   4
         3   3   3   3   3   3
     2   2   2   2   2   2   2 
 1   1   1   1   1   1   1   1 
t1  t2  t3  t4  t5  t6  t7  t8

such as the above drawing, the y-axis represents the existing regions at that moment. region 8 only appears at time t8, the probability of R8 being selected in the entire sample space is very small.
this issue will lead to a scenario: import a new table into a big cluster, the region of this table will be unevenly distributed, which will bring some hot store issues.

Describe alternatives you've considered

Empirically, We hope that the distribution of data is relatively random.

Teachability, Documentation, Adoption, Migration Strategy

Go
Stochastic Processes / Probability theory

The text was updated successfully, but these errors were encountered:

lhy1024 · 2020-06-10T04:26:16Z

The first proposal:

We put new regions with TTL to the fresh set and pick them with more priority. Its disadvantage is that it is difficult for us to give the probability between the fresh set and the other one.

lhy1024 · 2020-06-10T04:35:20Z

The second proposal:

Suppose that we expect all regions have the same probability of being scheduled in the past x period. And that the total number of regions at the current time is m.

When placing a new region with a TTL, which will live x period, the probability of being scheduled ( TTL + 1) /m

Its disadvantage is that when the difference between x and m is too large, there is not a big difference in their probability

Of course, we can choose non-linear reduction for TTL weight, such as (F(TTL)+1)/m

dimstars · 2020-06-29T07:03:13Z

I tested a solution:
The total online time of the system t1 minus the survival time of each region t2 is used to represent the scheduling weight w, and the sum of all region weights w is sumt. Each region is selected with a probability of w/sumt.

The figure below shows the comparison of the probability between the three modifications and the original scheme after 1000 cycles (a new region will be generated in each cycle). The strategies they choose for each cycle are as follows:
Original scheme: all regions are in no particular order with equal probability.
w: The w/sumt scheme above.
w^2: In order to further reflect the urgency of choosing a new region, w^2 is used instead of w for calculation.
w^10: Calculate w^10 instead of w.

It can be seen that the "w^10" scheme can greatly reduce the probability gap of being selected in most regions.

dimstars · 2020-06-29T07:26:51Z

Consider the following:

It should be that the probability of all regions being selected is basically consistent within a certain period of time x.
It is not appropriate to use the total online time of the system t as a reference, because old regions are difficult to be selected as time goes by.
Regions may be scheduled multiple times in a short time, so the amount of probability increase needs to be weighed.
The selected region may not be successfully dispatched. How to deal with this situation?

rleungx · 2020-06-29T07:56:10Z

It should be that the probability of all regions being selected is basically consistent within a certain period of time x.
It is not appropriate to use the total online time of the system t as a reference, because old regions are difficult to be selected as time goes by.

I think for a newly created region, we can increase the probability of being selected for that region through a kind of mechanism like TTL. But for the regions which have already existed in the system, we can still use the original way. There is no need to use the time-related way as a reference for all regions.

Regions may be scheduled multiple times in a short time, so the amount of probability increase needs to be weighed.

I am not very clear about what you mean here. Would you like to explain more?

The selected region may not be successfully dispatched. How to deal with this situation?

Maybe one possible way is to reset the TTL? @lhy1024

BTW, would you like to join our slack channel?

dimstars · 2020-06-29T08:16:26Z

I think for a newly created region, we can increase the probability of being selected for that region through a kind of mechanism like TTL. But for the regions which have already existed in the system, we can still use the original way. There is no need to use the time-related way as a reference for all regions.

Looks good. Maybe this is easier to implement. Calculating probabilities based on time is really hard to control. Because there will be many uncertain factors, such as scheduling failure and the irregular generation of new regions.

I am not very clear about what you mean here. Would you like to explain more?

This is the question we raised in our discussion. If the probability of a new region is excessively increased, the new region may be selected multiple times, leading to frequent scheduling. But if you don't use the "calculating probabilities" approach, there should be some mechanisms to avoid this.

BTW, would you like to join our slack channel?

Sure.

rleungx · 2020-06-29T08:31:56Z

This is the question we raised in our discussion. If the probability of a new region is excessively increased, the new region may be selected multiple times, leading to frequent scheduling. But if you don't use the "probability" approach, there should be some mechanisms to avoid this.

The reason why this problem existed is that we want to avoid the hot spot problem. Just as @nolouch said, "Import a new table into a big cluster, the region of this table will be unevenly distributed, which will bring some hot store issues", once we have selected a new region to balance, maybe that region can be removed from the new region set.

lhy1024 · 2020-06-29T08:45:37Z

Maybe one possible way is to reset the TTL?

@rleungx What does it mean, can you explain it？

lhy1024 · 2020-06-29T08:46:15Z

But for the regions which have already existed in the system, we can still use the original way.

I think it may be a good idea!

rleungx · 2020-06-29T08:47:59Z

Maybe one possible way is to reset the TTL?

@rleungx What does it mean, can you explain it？

If we have selected a region and create an operator for it, but the operator is timeout due to some reason. Do we need to reset/refresh the TTL?

dimstars · 2020-07-01T02:58:52Z

If we put a new region with TTL into a new region set and select them first. For the regions that already exist, the selection still follows the original method. When the new region is successfully dispatched, it is removed from the collection.

For different generation rates of regions and the probability of preferential selection of new regions, the following situations may occur:

Region is generated rapidly, and the new region set is getting larger and larger.
Periodically removing older regions from the new set solves this problem. (TTL is needed here)
New Regions remained at a stable number.
This is the ideal state.
The collection of new regions is very small. Whenever a region is generated, it will be selected almost immediately.
I'm not sure that would have a bad effect.

The disadvantage is that it's very difficult to give a probability between a new set and another set. Maybe we can experiment to get the right value, or maybe it's not a fixed value.

lhy1024 · 2020-09-22T09:40:00Z

We may need to consider adding the configuration, effect and cluster size of the strategy and some other information to the telemetry

nolouch added type/enhancement The issue or PR belongs to an enhancement. component/schedule Scheduling logic. epic/co-workload labels Jun 9, 2020

rleungx added the status/TODO The issue will be done in the future. label Jun 10, 2020

HunDunDM closed this as completed Jun 10, 2020

HunDunDM reopened this Jun 10, 2020

rleungx mentioned this issue Jul 2, 2020

region: Add accelerate regions schedule API #2599

Merged

dimstars mentioned this issue Jul 6, 2020

Increase the probability of balancing new regions #2611

Closed

rleungx removed the epic/co-workload label Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the probability of balancing new regions #2524

Increase the probability of balancing new regions #2524

nolouch commented Jun 9, 2020 •

edited

Loading

lhy1024 commented Jun 10, 2020 •

edited

Loading

lhy1024 commented Jun 10, 2020 •

edited

Loading

dimstars commented Jun 29, 2020 •

edited

Loading

dimstars commented Jun 29, 2020 •

edited

Loading

rleungx commented Jun 29, 2020

dimstars commented Jun 29, 2020 •

edited

Loading

rleungx commented Jun 29, 2020

lhy1024 commented Jun 29, 2020

lhy1024 commented Jun 29, 2020 •

edited

Loading

rleungx commented Jun 29, 2020

dimstars commented Jul 1, 2020

lhy1024 commented Sep 22, 2020

Increase the probability of balancing new regions #2524

Increase the probability of balancing new regions #2524

Comments

nolouch commented Jun 9, 2020 • edited Loading

Feature Request

Describe your feature request related problem

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

lhy1024 commented Jun 10, 2020 • edited Loading

lhy1024 commented Jun 10, 2020 • edited Loading

dimstars commented Jun 29, 2020 • edited Loading

dimstars commented Jun 29, 2020 • edited Loading

rleungx commented Jun 29, 2020

dimstars commented Jun 29, 2020 • edited Loading

rleungx commented Jun 29, 2020

lhy1024 commented Jun 29, 2020

lhy1024 commented Jun 29, 2020 • edited Loading

rleungx commented Jun 29, 2020

dimstars commented Jul 1, 2020

lhy1024 commented Sep 22, 2020

nolouch commented Jun 9, 2020 •

edited

Loading

lhy1024 commented Jun 10, 2020 •

edited

Loading

lhy1024 commented Jun 10, 2020 •

edited

Loading

dimstars commented Jun 29, 2020 •

edited

Loading

dimstars commented Jun 29, 2020 •

edited

Loading

dimstars commented Jun 29, 2020 •

edited

Loading

lhy1024 commented Jun 29, 2020 •

edited

Loading