Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store-tikv: drop invalid cached region #4506

Merged
merged 15 commits into from
Sep 25, 2017

Conversation

atmzhou
Copy link
Contributor

@atmzhou atmzhou commented Sep 12, 2017

Some cached region may be out-of-date for they are long not used.
To avoid use these regions, we allocate each cached region a TTL, which records when the region may be out-of-date.
When an out-of-dated region is identified, we remove the region out of the region cache.

Fix #2511
Fix #4498

@coocood
Copy link
Member

coocood commented Sep 12, 2017

@atmzhou
I think adding an accessTime uint64 field in Region is simpler.
Multiple threads may update the access time concurrently,
we can update it with atomic operations to avoid race.

@coocood
Copy link
Member

coocood commented Sep 12, 2017

@atmzhou
Please update the title of the PR.

@atmzhou atmzhou changed the title Ningnanzhou/dropinvalidregioncache Ningnanzhou/DropInvalidCachedRegion Sep 12, 2017
@atmzhou atmzhou changed the title Ningnanzhou/DropInvalidCachedRegion store-tikv/DropInvalidCachedRegion Sep 12, 2017
@atmzhou atmzhou changed the title store-tikv/DropInvalidCachedRegion store-tikv: drop invalid cached region Sep 12, 2017
@atmzhou atmzhou closed this Sep 12, 2017
@atmzhou atmzhou reopened this Sep 12, 2017
@atmzhou
Copy link
Contributor Author

atmzhou commented Sep 12, 2017

@coocood
Do you think it is necessary to replace the mutex by atomic operation?

@atmzhou
Copy link
Contributor Author

atmzhou commented Sep 12, 2017

@disksing @AndreMouche @tiancaiamao
Could you please review my code at convenience?
Thanks~

@coocood
Copy link
Member

coocood commented Sep 12, 2017

@atmzhou
We can't replace the mutex by atomic operation.
RegionCache uses RWLock which allows concurrent reading.
But reading updates the access time, which is a write operation, so we need atomic operation to avoid race.

@atmzhou
Copy link
Contributor Author

atmzhou commented Sep 12, 2017

@coocood
So, what is your conclusion? Should we use the RWLock or atomic operations?
By the way, I think that the data race is rare and a little data race does not matter here.

@coocood
Copy link
Member

coocood commented Sep 12, 2017

@atmzhou
We use both RWLock and atomic operation, RWLock for the RegionCache, atomic operation for the Region.
Data race is strictly not allowed in TiDB. The CI fails if any data race is detected.

@atmzhou
Copy link
Contributor Author

atmzhou commented Sep 13, 2017

@coocood
OK, I will eliminate any potential data race here.
By the way, in fact, this data race cannot be detected because it does not affect correctness. It just removes some region from the cache falsely or accesses an out-of-dated region, which will also be handled.

@coocood
Copy link
Member

coocood commented Sep 13, 2017

@atmzhou
I will be detected when two goroutines access the same variable concurrently and at least one of the accesses is a write.

See https://golang.org/doc/articles/race_detector.html

sync.RWMutex
regions map[RegionVerID]*Region
sorted *llrb.LLRB
ttls map[RegionVerID]time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move the ttl field to Region, so no extra map need to be maintained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Region should not have a field ttl.
If we want to save a map structure, we can merge the two maps as
regions map[RegionVerID]<*Region, time.Time>.
However, this may incur updating a lot of codes.
So, I prefer keeping the two maps separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atmzhou
Then how about accessTime, Region can have an accessTime field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coocood @tiancaiamao
I have make a structure named CachedRegion as follows:
type CachedRegion struct {
region Region
ttl int64
}
Accordingly, each RegionCache will have a map map[RegionVerID]*CachedRegion.
In this way, we save a map structure.
I think this should conform to your thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atmzhou
ttl is changing over time, we can't store a ttlvalue in field.

@atmzhou atmzhou force-pushed the ningnanzhou/dropinvalidregioncache branch from d01a2d0 to f7305aa Compare September 18, 2017 07:18
// CachedRegion encapsulates {Region, TTL}
type CachedRegion struct {
region *Region
ttl int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ttl is a duration, not a timestamp, why not name it lastAccess directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will change it. Thanks~


// CachedRegion encapsulates {Region, TTL}
type CachedRegion struct {
region *Region
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rename Region to CachedRegion directly and add the new field there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regions may be used to partition requests. For this kind of tasks, ttl is of no use.
In fact, ttl is only used for cache.
So, I prefer defining a new structure CachedRegion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Region is a cached region with or without access time.
Use different type just to add the access time makes confusion.

// For example, T1 find ttl valid but before T1 freshes the ttl, T2 find ttl invalid
// In this case, T2 will invoke this function but T1 may have freshed the ttl
// Then, T2 should not drop the cached region
ttl := atomic.LoadInt64(&cachedregion.ttl) // we may not need an atomic load, I think a simple load is enough?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to check the access time again, just delete it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, both rechecking and directly deleting are correct.
Directly deleting may simplify the code.
However, I think it leads to strange cases. Suppose that there are two visits on the same cached region. If the first visit determines to refresh the ttl of the region, when it conducts this action, it may find that the region was removed because of out-of-date. This is strange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first reader found the region is valid, then he will just refersh the region, he won't lookup the map for the region again, so he never find that region was removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,You are right.
Either deleting or rechecking is correct.
The choice depends on whether we want the execution is equivalent to the actual happening order of the two operations.
By rechecking, I choose that the execution should be equivalent to the actual happening order of the two operations.
This may be more strict than directly deleting.
I think the two possible choices are OK.

func (c *RegionCache) GetCachedRegion(id RegionVerID) *Region {
for {
c.mu.RLock()
cachedregion, ok := c.mu.regions[id]
Copy link
Member

@coocood coocood Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just RUnlock here.
The lock only protects the regions map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will change the code. Thanks~

lastAccessTime := time.Unix(lastAccess, 0)
if time.Since(lastAccessTime) < rcDefaultRegionCacheTTL {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can extract L94-L98 into a method of CachedRegion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will revise this.

@atmzhou atmzhou force-pushed the ningnanzhou/dropinvalidregioncache branch from 5ed707b to 7f1a6bf Compare September 19, 2017 02:38
@@ -26,12 +28,23 @@ import (
goctx "golang.org/x/net/context"
)

const (
rcDefaultRegionCacheTTL = time.Minute * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a longer duration is more suitable. 10 minute for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Thanks~

if time.Since(lastAccessTime) < rcDefaultRegionCacheTTL {
return true
}
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return time.Since(lastAccessTime) < rcDefaultRegionCacheTTL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Thanks~

@@ -66,17 +79,61 @@ func (c *RPCContext) GetStoreID() uint64 {
return 0
}

func (c *CachedRegion) isValid() bool {
lastAccess := atomic.LoadInt64(&c.lastAccess) // we may not need an atomic load, I think a simple load is enough?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use atomic, or the race detector will complain. Please remove the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Thanks~
By the way, in fact, we do not need to use atomic.


// CachedRegion encapsulates {Region, TTL}
type CachedRegion struct {
region *Region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about region Region ?

// CachedRegion encapsulates {Region, TTL}
type CachedRegion struct {
region *Region
lastAccess int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastAccess time.Time ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use int64 to enable atomic operation

// For example, T1 find ttl valid but before T1 freshes the ttl, T2 find ttl invalid
// In this case, T2 will invoke this function but T1 may have freshed the ttl
// Then, T2 should not drop the cached region
if cachedregion.isValid() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's under lock, why there are concurrent threads may fresh it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because concurrent threads use atomic operations to fresh it.
In this way, we save write lock.

@disksing
Copy link
Contributor

LGTM.

@@ -138,26 +168,24 @@ func (c *RegionCache) LocateKey(bo *Backoffer, key []byte) (*KeyLocation, error)

// LocateRegionByID searches for the region with ID
func (c *RegionCache) LocateRegionByID(bo *Backoffer, regionID uint64) (*KeyLocation, error) {
c.mu.RLock()
if r := c.getRegionByIDFromCache(regionID); r != nil {
r := c.getRegionByIDFromCache(regionID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@disksing Should we use GetCachedRegion here?
This method is only used by HTTP status server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have different parameters. BTW, I think we need lock here. @atmzhou

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@disksing I remember that you said this function is only used for test. So I did not add lock in getRegionByIDFromCache.
Do you think we should add lock? If yes, I will add.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We don't want debug API have data race too, especially it may affect online service.

@atmzhou atmzhou force-pushed the ningnanzhou/dropinvalidregioncache branch from cf8a5b8 to cf482f0 Compare September 22, 2017 09:27
@disksing
Copy link
Contributor

PTAL @coocood

Copy link
Member

@coocood coocood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@atmzhou atmzhou merged commit b9c3bc4 into master Sep 25, 2017
@atmzhou atmzhou deleted the ningnanzhou/dropinvalidregioncache branch September 25, 2017 04:59
@sre-bot sre-bot added the contribution This PR is from a community contributor. label Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution This PR is from a community contributor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

store/tikv: eliminate old region from cache. A better strategy to cache region's address.
5 participants