-
Notifications
You must be signed in to change notification settings - Fork 85
native build directories hardlinking file fetcher de-duplicates in-flight download requests #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ight download requests
|
pkg/cas/hardlinking_file_fetcher.go
Outdated
| select { | ||
| case <-d.wait: | ||
| if d.err != nil { | ||
| return d.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. A caller of GetFile() may receive an error that was produced by another invocation. This should not happen, as errors may also include things like context cancelation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to retry if it failed because of context cancelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bb_storage/pkg/auth/remoteAuthorizer.authorizeSingle also retries but has the same kind of flaw. Consider the case where the timeout is 55s, downloading the file takes 50s, one call is coming in every 30s and that the first call is cancelled after 35s. Then all retries for the later requests will also fail because when the retry starts, the context has less than 50s left.
0s RPC 0
30s RPC 1
45s RPC 0 cancelled, RPC 1 retries
60s RPC 2
85s RPC 1 times out (5s left to download), RPC 2 retries
90s RPC 3
115 RPC 2 times out (15s left to download), RPC 3 retries
120s RPC 4
115 RPC 3 times out (15s left to download), RPC 4 retries
...
One solution is to manually cancel a context.Background() when there are no requests left waiting. I don't know how complex the implementation and tests for this kind of solution will be or if this is just not worth bothering with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just eliminate the error propagation logic entirely? Just change that downloads map to:
downloads map[string]<-chan struct{}Only use that to wait an existing download to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use the new map.
| ff.downloads[key] = d | ||
| ff.downloadsLock.Unlock() | ||
|
|
||
| downloadErr := ff.base.GetFile(ctx, blobDigest, directory, name, isExecutable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that there is still a slight race here, where files get downloaded unnnecessarily/redundantly? Shouldn't we do another tryLinkFromCache() at this specific point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Added.
pkg/cas/hardlinking_file_fetcher.go
Outdated
|
|
||
| type download struct { | ||
| wait chan struct{} | ||
| err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| err error | |
| // err is available only when wait has been closed. | |
| err error |
pkg/cas/hardlinking_file_fetcher.go
Outdated
| } | ||
|
|
||
| type download struct { | ||
| wait chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It is possible to be more strict here with wait <-chan struct{} and further down use
wait := make(chan struct{})
d = &download{wait: wait}
close(wait)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| d = &download{wait: make(chan struct{})} | ||
| ff.downloads[key] = d | ||
| ff.downloadsLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ff.downloadsLock does not necessarily need to be locked when setting d.err as it is only to be read after d.wait has been closed. This means that defer can be used.
| d = &download{wait: make(chan struct{})} | |
| ff.downloads[key] = d | |
| ff.downloadsLock.Unlock() | |
| wait := make(chan struct{}) | |
| d = &download{wait: wait} | |
| ff.downloads[key] = d | |
| ff.downloadsLock.Unlock() | |
| defer func() { | |
| ff.downloadsLock.Lock() | |
| delete(ff.downloads, key) | |
| ff.downloadsLock.Unlock() | |
| close(wait) | |
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use defer. The code is cleaner.
| wasMissing, err = ff.tryLinkFromCache(key, directory, name) | ||
| if err == nil { | ||
| // File appeared in cache, no download needed. | ||
| ff.downloadsLock.Lock() | ||
| delete(ff.downloads, key) | ||
| ff.downloadsLock.Unlock() | ||
| close(d.wait) | ||
| return nil | ||
| } else if !wasMissing { | ||
| // tryLinkFromCache had a real error (not just missing). | ||
| ff.downloadsLock.Lock() | ||
| d.err = err | ||
| delete(ff.downloads, key) | ||
| ff.downloadsLock.Unlock() | ||
| close(d.wait) | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on line 125.
| wasMissing, err = ff.tryLinkFromCache(key, directory, name) | |
| if err == nil { | |
| // File appeared in cache, no download needed. | |
| ff.downloadsLock.Lock() | |
| delete(ff.downloads, key) | |
| ff.downloadsLock.Unlock() | |
| close(d.wait) | |
| return nil | |
| } else if !wasMissing { | |
| // tryLinkFromCache had a real error (not just missing). | |
| ff.downloadsLock.Lock() | |
| d.err = err | |
| delete(ff.downloads, key) | |
| ff.downloadsLock.Unlock() | |
| close(d.wait) | |
| return err | |
| } | |
| wasMissing, d.err = ff.tryLinkFromCache(key, directory, name) | |
| if !wasMissing { | |
| // Either the file appeared in cache, no download needed, | |
| // or tryLinkFromCache had a real error (not just missing). | |
| return d.err | |
| } |
pkg/cas/hardlinking_file_fetcher.go
Outdated
| select { | ||
| case <-d.wait: | ||
| if d.err != nil { | ||
| return d.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bb_storage/pkg/auth/remoteAuthorizer.authorizeSingle also retries but has the same kind of flaw. Consider the case where the timeout is 55s, downloading the file takes 50s, one call is coming in every 30s and that the first call is cancelled after 35s. Then all retries for the later requests will also fail because when the retry starts, the context has less than 50s left.
0s RPC 0
30s RPC 1
45s RPC 0 cancelled, RPC 1 retries
60s RPC 2
85s RPC 1 times out (5s left to download), RPC 2 retries
90s RPC 3
115 RPC 2 times out (15s left to download), RPC 3 retries
120s RPC 4
115 RPC 3 times out (15s left to download), RPC 4 retries
...
One solution is to manually cancel a context.Background() when there are no requests left waiting. I don't know how complex the implementation and tests for this kind of solution will be or if this is just not worth bothering with.
HongboDu-at
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| } | ||
|
|
||
| type download struct { | ||
| wait chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| select { | ||
| case <-d.wait: | ||
| if d.err != nil { | ||
| return d.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use the new map.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| d = &download{wait: make(chan struct{})} | ||
| ff.downloads[key] = d | ||
| ff.downloadsLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use defer. The code is cleaner.

From internal usage, this reduces fetching inputs(~4GB) time from ~60s to ~20s.