-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust download size scaling factor (part 1) #567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status of this: I was looking at this today but I weirdly keep getting timeouts testing this in Qubes (queue keeps pausing), this is with ~1MB files, will give it a deeper look in the morrow
timeout = math.ceil(size_in_bytes / SMALL_FILE_DOWNLOAD_TIMEOUT_BYTES_PER_SECOND) | ||
return max(MINIMUM_TIMEOUT, timeout) | ||
|
||
timeout = math.ceil(size_in_bytes / DOWNLOAD_TIMEOUT_BYTES_PER_SECOND) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK tested a bit more carefully today and the timeouts are too small for the >1 MB case - I'm able to download <1 MB files without issue but anything 1 MB or larger was timing out until I increased this. Want to try a bit with 1 MB files and confirm my finding here? If you agree then we can crank these timeouts a bit. We should make sure we have reasonable timeouts for files 1 MB - 500 MB in size as those are the supported large file sizes for SD (obviously 500 MB will be sloooooow so worth doing in the background while working on something else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently, this is what we're doing:
# ceil(file_size / 10000 bytes/sec) is expected for file sizes less than 1MB
# ceil(file_size / 100000 bytes/sec) is expected for file sizes greater than or equal to 1MB
# minimum timeout is 3 seconds for files less than 1MB
assert one_byte_file_timeout == 3
assert one_KB_file_timeout == 3
assert fifty_KB_file_timeout == 5
assert five_hundred_KB_file_timeout == 50
assert one_MB_file_timeout == 10
assert five_MB_file_timeout == 50
assert five_hundred_MB_file_timeout == 5000
assert GB_file_timeout == 10000
This means:
- 1 byte file times out after 3 seconds
- 1KB file times out after 3 seconds
- 50KB file times out after 5 seconds
- 500KB file times out after 50 seconds
- 1MB file times out after 10 seconds
- 5MB file times out after 50 seconds
- 500MB file times out after 5000 seconds (~1.39 hours)
- 1GB file times out after 10000 seconds (~2.78 hours)
I think the 1MB timeout is too aggressive as you pointed out. We don't want to make the timeouts too large, plus we'll probably introduce a ramp-up timeout policy on request retries, so how about two small adjustment: (1) we use two minimum timeouts: 10s for files smaller than 1MB, otherwise 75s, and (2) we multiply the timeout by 1.5.
With the adjustments, we would end up with:
- 1 byte file times out after 10 seconds
- 1KB file times out after 10 seconds
- 50KB file times out after 10 seconds
- 500KB file times out after 75 seconds
- 1MB file times out after 75 seconds
- 5MB file times out after 75 seconds
- 50MB file times out after 750 seconds (12.5 minutes)
- 500MB file times out after 7500 seconds (~2.08 hours)
- 1GB file times out after 10000 seconds (~4.17 hours)
Do these seem like reasonable default timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those adjusted timeouts do sound reasonable to me - were you able to reliably download 1 MB and 5 MB files in 75 seconds? I had no problem with small < 1MB files so those small file timeouts are definitely solid, and 2 hours is about right for 500 MB (that's also the session timeout on the source interface). 500 MB is the largest file we need to consider since the securedrop source interface rejects files larger than that.
yeah ultimately we really do need to implement range requests support... it's gonna be painful when a 500 MB download fails in the first 1 MB and we don't find out until after 2 hours (or when it fails after 499 MB and we restart the entire download).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other suggestion to simplify this a bit, instead of using two different speeds and minimums, we just use the same speed and use a larger minimum of 25 seconds to account for smaller files that take longer, so timeout = ceil((file_size/100000)*1.5)+25
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were you able to reliably download 1 MB and 5 MB files in 75 seconds
yup, and to make sure I was using the faster rate I downloaded a 1.2MB file, which in my manual tests takes less than 12seconds. According to tor stats, it should take 7.5s to download ~1MB. I ended up pushing the suggestion above timeout = ceil((file_size/100000)*1.5)+25
which increases the timeouts across the board with 1MB set to a 40s timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
assert KB_file_timeout == 26 | ||
assert fifty_KB_file_timeout == 26 | ||
assert half_MB_file_timeout == 33 | ||
assert MB_file_timeout == 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested a 1 MB file download with 40 seconds timeout and 5 MB file with 100 second timeout successfully in staging
Description
This is a quick adjustment to our timeout estimation function that scales the timeouts per file so that in general it increases as the file size increases.
It turns out that we could only download very small files due to a scaling bug which added 7.5 seconds to the timeout for every single byte in the file. This change makes it so that timeout is calculated by dividing the size of the file by the estimated download rate, which we get from Tor metrics: https://metrics.torproject.org/torperf.html?start=2019-05-02&end=2019-07-31&server=onion
Couple notes:
This is not a final solution. This is a quick fix to make it so that most of the time our download timeouts are pretty good and we can download files. It does not account for unexpected network latency issues. This will be addressed in a followup PR that will ramp up the timeout value each time the download job is retried, up to a certain cap of number of retry attempts.
The download speed is slower per byte for files ~50KiB due to overhead and other possible factors. This is based on the metrics provided in the link above. Our scaling function takes this into account by simply changing the rate of the download speed to a slower rate if the file is less than 1MiB. You can see how this is reflected in the test assertions I added.
As you might expect, a timeout is larger than the expected download time, which is why the rates we use are slower than the rates reported by Tor.
* I'm leaving #546 open until the ramp-up timeout implementation and/or range requests are added.Actually, this does fix the bug that prevented us from downloading large files, but I want to make sure we add the ramping up/ range requests soonish...
Fixes #546
Test Plan
Checklist
If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable: