-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow data extraction through opendap when using twitcher #97
Comments
ping @dbyrns @fmigneault here is the performance issue my college observed for you to assess, thanks |
Thanks @tlvu we will take care of this. |
@tlvu @aulemahal |
I suspect this to be the culprit, but needs to be investigated. |
@fmigneault It's the current head of master in the birdhouse-deploy repo: https://github.com/bird-house/birdhouse-deploy/blob/775c3b392813872cb8045be473d6e4b091d52d88/birdhouse/docker-compose.yml#L310 |
Follow up de @fmigneault De mémoire, le fix apporté par David Caron était plus relié au caching pour les requêtes répétitives, mais cela n'impacte pas vraiment la vitesse de download one-shot d'un fichier Thredds tel que présenté ici. Sinon, l'autre issue de streaming de la réponse dans Twitcher semble être intégrée depuis longtemps: La seule chose que je peux voir est que Le Lines 58 to 63 in dccbe33
|
@dbyrns @tlvu No code change to Twitcher. Just changed the WSGI app runner. Stepping through the code until going into I didn't bother going through how My test results are as follows :
|
FYI, benchmark of a few other WSGI server: https://www.appdynamics.com/blog/engineering/a-performance-analysis-of-python-wsgi-servers-part-2/ |
I gave a try to the highest-rated ones other than Obviously, we can't really see a difference for a single request of file download. At the very least, anything works better than
|
Fixes #97 Replaces `waitress` by `gunicorn` by default. `waitress` seems to do some kind of deduplication of contents with extra read/write of chucks during data streaming. Could not find any way to avoid `waitress` to do this with some option/flag. Tests results can be found here: - #97 (comment) - #97 (comment) Leaving waitress in dependencies so that pre-existing deploys can use whichever they want between `gunicorn` / `waitress`. Important note: `gunicorn > 20` needs to be started with `pserve` to have correct port assigned. There is a change of parsing `PasteDeploy` config where starting directly with `gunicorn` doesn't detect the right port. (doesn't matter that much in this case, since default is 8000 which is also the one employed) Regarless, Docker image was already using `pserve`, so everything should work just fine.
I've tried to backport this fix to PAVICS, see commit bird-house/birdhouse-deploy@84df6c9 The fix provided no improvements at all on my test VM, see notebook https://gist.github.com/tlvu/40346480fd17deaa10915bd0010f55c6#file-test-fix-for-slow-twitcher-ipynb I'd suggest you test this fix in a PAVICS stack itself to be sure. Maybe additional configs changes are required for PAVICS. |
Tested with bird-house/birdhouse-deploy@84df6c9 and no difference with https://gist.github.com/MatProv/e5065c73fddf508aeaf3bf6304231f15 |
Shoulddn't this Twitcher overhead be 21.201374769210815 / 8.687683343887329 * 100 = 244% and not 5% ? :D |
Great notebook, ready for automation ! I looked at the test result calculation (the bad 5%) and it looks good. I think the notebook output simply needs to be refreshed. |
@tlvu Oh no, forgot to specify that only the first two tests have been run (twitcher+nginx+thredds and nginx+thredds) for this test. Difference was too obvious so actual assertion were useless. So yes, old cells outputs are a bit misleading here. Won't happen when it will be part of the automated suite ;) |
Thank you @matprov, now we see that for a 60go file, twitcher is introducing a latency of 13sec. which is not bad at all given that to download that file over a 1Gbps link would require at least 8 minutes. |
@dbyrns What? I have not seen 3000% anywhere. But we still have a 244% overhead, see my comment #97 (comment). It's not acceptable at all for us. Also as you are aware, we have a Thredds optimization project with SSD and it clearly shows on that project. The testing notebooks is private since it has internal hostnames (the host we fake the SSD with RAM disk) but the Twitcher overhead vs Nginx alone is routinely between 300% and 400%. |
My 3000% came from @aulemahal original issue description
We can discuss that in our next meeting, but sadly we don't have the budget to achieve nginx's level of performance neither. Please consider a contribution from your project if the security and performance are required or maybe bypass twitcher completely in the internal hostname / RAM disk setup as this is not the intended use case. Even with the test case used by @matprov, 60GB in 21sec, implied at least a 22Gbps bandwidth, which is not representative of the real world. Understand me well, I agree that the performance isn't stellar, but we're exhausting our ressource and while @ldperron is trying to still improve the performance we're starting to exhaust our budget on this issue and will need to stop working on that soon. |
@tlvu we might be on a bad lead. You said that caching should not improve performance because we are downloading one big file (bandwidth issue), but we are looking if by using the opendap protocol we would get multiple requests where it's more a latency issue. Also, I just found out that the whole 60Go is never downloaded in the test so let's forget the bandwidth part of the conversation. Hopefully the light will be shed soon on this question. |
Good news! In fact it was really a caching issue. @fmigneault this is what we have found that is not in the cache currently and could be added to further improve the performance : |
@dbyrns This SQL query seemed to be caused by this, with the resulting service calling the cached ACL : It will require a new Magpie version with added caching of this step. |
@dbyrns I don't remember what I said but I most probably referred to Thredds caching and not Magpie SQL query caching. Since we repeated hit the same file on Thredds, this would make the Thredds cache "hot" so how big is Thredds cache should not impact this performance test. So I am glad we are onto something with Magpie SQL query caching. Did not expect such a tight coupling between Twitcher and Magpie. |
Great ! Thanks to @ldperron for the investigation and the patch. @fmigneault Can this be merged now and is compatible with the current Magpie 1.7.3 ? |
Should be available, it was added in 1.6 |
I was referring to this : #97 (comment) and indeed the relation between opendap, twitcher, magpie and thredds was not obvious. So to sum it up:
2 and 3 cause latency and this is what Ouranosinc/Magpie#388 and bird-house/birdhouse-deploy@7737c5a will solve. Thank you all and I'm really looking forward to see Magpie 3.7 integrated in the stack! |
## Overview Applies the same changes as #174 (cache settings) but disable Twitcher caching explicitly (Magpie caching already disabled). This is in order to validate that cache settings are applied and tests suite can run successfully with cache disabled, to isolate that sporadic errors seen in #174 are related to enabled cache. ## Changes **Non-breaking changes** - Added cache settings of `MagpieAdapter` through Twitcher (intended to improve response times, but disabled in this PR). - Update Magpie/Twitcher version 3.14.0 **Breaking changes** - Unique user email in Magpie 3.13.0 is enforced, see birdhouse-deploy changelog for how to find them and update the conflicted values before the upgrade. Otherwise the upgrade will break. See [Magpie 3.13.0 change log](https://pavics-magpie.readthedocs.io/en/3.13.0/changes.html#bug-fixes) for details regarding the introduced unique email restriction. ## Related Issue / Discussion - Undo resolution of [DAC-296 Twitcher performance issue](https://www.crim.ca/jira/browse/DAC-296) since caching now disabled - Undo resolution of bird-house/twitcher#97 since cache becomes disabled - Based on top of #174 with cache-settings fixes - Related new stress test Ouranosinc/PAVICS-e2e-workflow-tests#74
Bump version: 1.13.14 → 1.14.0 Following merge of pull request #182 from bird-house/cache-settings-off ## Overview Applies the same changes as #174 (cache settings) but disable Twitcher caching explicitly (Magpie caching already disabled). This is in order to validate that cache settings are applied and tests suite can run successfully with cache disabled, to isolate that sporadic errors seen in #174 are related to enabled cache. ## Changes **Non-breaking changes** - Added cache settings of `MagpieAdapter` through Twitcher (intended to improve response times, but disabled in this PR). - Update Magpie/Twitcher version 3.14.0 **Breaking changes** - Unique user email in Magpie 3.13.0 is enforced, see birdhouse-deploy changelog for how to find them and update the conflicted values before the upgrade. Otherwise the upgrade will break. See [Magpie 3.13.0 change log](https://pavics-magpie.readthedocs.io/en/3.13.0/changes.html#bug-fixes) for details regarding the introduced unique email restriction. ## Related Issue / Discussion - Undo resolution of [DAC-296 Twitcher performance issue](https://www.crim.ca/jira/browse/DAC-296) since caching now disabled - Undo resolution of bird-house/twitcher#97 since cache becomes disabled - Based on top of #174 with cache-settings fixes - Related new stress test Ouranosinc/PAVICS-e2e-workflow-tests#74
@ldperron could you let me know how to reproduce your test setup to the find the result above? See original comment #97 (comment) The performance and intermittent problem is back again, see #123 |
Working on a notebook running on boréas I noticed that using twitcher really slows down the process.
MWE:
Test 1 using twitcher (normal url)
Output:
And CPU usage (of the notebook) is super low, around 2-3 % for most of the 3 mn. I can see that twitcher (on boréas) runs with 30-60% CPU.
Test 2 bypassing twitcher:
Output:
Also, CPU usage of the notebook stays reasonable (30-40%).
This is a simple example with a small region, but in my notebook I need to extract the same region (potentially large) from 23 different datasets. Going through twitcher means I have to wait around 1 hr in some cases, even when launching multiple (2-3) extraction in parallel...
The text was updated successfully, but these errors were encountered: