Large scale considerations #173

adriendelsalle · 2020-11-19T21:07:18Z

I would like to open that issue to list what points are important to keep in mind in the development of Quetz in the perspective of a large scale use.

What I have in mind:

Language or dependencies

what is the max load that FastAPI could handle
Choice of Python as a base language for backend operations (extracting tarballs, generation of json patches, etc.)
- context of providing views depending on the users authorizations, partially handle by database requests
- multi-threading for cpu bound ops
- etc.

Database/storage

even using PGSQL, projections of volumetry and ops/s to be able to handle
do we expect the need of implementing machinery to speedup requests (caching) in others databases? On the filesystem?
impact of the filesystem, best choice for the read/write operations
need for distributed filesystems?

Others

role-based vs attribute-based control?

This is just a draft to be updated with contributions (concerns, solutions, links to pr, etc.)!

The text was updated successfully, but these errors were encountered:

bollwyvl · 2021-02-11T06:31:35Z

Thanks for bringing these up! A couple thoughts:

python/fastapi perf:
yeah, sure, python isn't rust or c++. fastapi is down around 250 in the benchmark game, so there are plenty of other things to choose. PyPy could potentially jump it up a hair, though I don't think all the deps are there yet. but man, I'd sure like a conda package repo that spoke graphql! anyhow, the variant that does best also uses orjson, but who knows, maybe simdjson, or one of the others has even more to say. aside: hadn't heard of apidaora (current leading python framework)... learn some new web junk every day!

distributed filesystem:
perhaps not what @adriendelsalle had in mind, but ipfs is a very interesting beast, as it theoretically has no single point of failure. I've almost got it built for conda-forge, which is cute, but what's more interesting is it can handle netflix-level volume/velocity. If a community (say conda-forge) can fiat a peer-of-last-resort (seems like 2tb of conda-forge would be ~$100/mo from a pinning service), cloudflare will foot the bill (for now) for CDN, and quetz would be none the wiser when replicating it... or some deeper integration would be possible. an ipfs-native client hardly seems infeasible at this point.

database:
this is one of the places where the go-to fastapi/sqlalchemy orm strategy can be a bear. if specifically talking pg, it's possible to use the binary protocol (even with orm) with asyncpg, which handles a number of issues on the database and app server by doing less work.

btel · 2021-02-11T09:30:28Z

BTW we have done some load testing using locust and we can process around 100 rps (requests per second) on a standard laptop using single quetz worker (for the download endpoint which generate a redirect to S3 file).

bollwyvl · 2021-02-11T14:41:34Z

@btel Aw, yeah, locust is wonderful! (disclaimer: maintains conda-forge feedstock 👿). For giggles, can you toss the stats summary output? While pretty, i find the charts lie, as small error counts, etc. can still look flat.

It would be lovely to have this under test... not for absolute numbers, but to catch significant regressions (e.g. starts throwing lots of 500s). Basically, CI caches the repo's HEAD summary, PRs download that, and start failing if the numbers change by SOME_THRESHOLD, depending on the route.

To that point, having this for every route is important, especially with a couple admins and a horde of users changing lots of stuff (especially permissions!) at a furious rate, as it can reveal nasty things like full-table database locks which don't get caught when tested in isolation.

Another tool in the shed, to both improve the baseline, and help debug perf regressions, is the opencensus stack, with a simple example here. It looks like there is some work going on to give finer-grained insights of the fastapi side, while the sqlalchemy integration is already very robust. I've used the jaeger integration (also on conda-forge, might need some maintainer ❤️) for reporting.

I've yet to do a FULL full stack integration with opencensus-web, but this is the real cadillac, as you can trace button gets pushed in SPA to pixels on page for a single request, which is a thing of beauty when it works properly.

Having all these hooks built in to the various tiers, ready to be turned on by a site admin, can help them really own an application, beyond simple log mining, and can yield much better issue reports. Trying to get this level of insight from a "hostile" application is... harder.

btel · 2021-02-11T15:08:00Z

hi @bollwyvl, thanks for the valuable suggestions. Automatizing load testing is definitely on our roadmap. I haven't every used the openncensus stack, it's definitely something I would like to investigate. Thanks again for the pointers!

btel · 2021-02-11T15:17:35Z

i forgot about the locust stats, I need to re-generate them, because stupidly I did not conserve them.

btw we benchmarked the download endpoint, because it's the one that's going to be most frequently hit by users (and CIs), but I agree we should test other endpoints as well.

Bartosz

bollwyvl · 2021-02-17T02:53:07Z

minor update: we got the jaeger-feedstock updated to the most recent version (go has been rethinking their packaging approach, har).

bollwyvl · 2021-03-12T22:44:21Z

another update: go-ipfs-feedstock should exist soon (not up yet, but GH having a bad day, I guess). @wolfv @yuvipanda and i have been semi-seriously kicking around ideas on federated stuff for a while, so i guess it's a little more real (to me) now!

atrawog · 2021-05-30T08:07:26Z

What could speed up quetz by a lot is a smart caching system. Most quetz content are static files that don't get updated very often and there is no need to make a db or even a fastapi request for content that already got served and hasn't changed since.

For an in memory cache fastapi-cache is the obvious choice. But the easiest strategy to implement is probably generating proper ETag in fastapi and then put a really large NGINX Content Cache on top. NGINX is really fast with serving static content and if most of the requests get cached the raw fastapi performance is less of an issue.

The other things that might be worth a look are Backblaze for storage and Cloudflare as CDN.

atrawog · 2021-05-30T08:17:03Z

another update: go-ipfs-feedstock should exist soon (not up yet, but GH having a bad day, I guess). @wolfv @yuvipanda and i have been semi-seriously kicking around ideas on federated stuff for a while, so i guess it's a little more real (to me) now!

Personally I wouldn't dare to put a production system right on top of IPFS. But IPFS could the perfect solution for a long term package archive and /or package distribution system.

yuvipanda · 2021-05-30T10:08:08Z

I spent time trying to run IPFS for one of my side projects, but switched back to an S3 API instead. You still need to run pinning nodes, and the well tested setups pin content to filesystems. So you end up needing to run a cluster that requires you run file systems, which can get messy. Latency was also highly variable. I think it's getting better, but it's not useful in medium - large scales right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large scale considerations #173

Large scale considerations #173

adriendelsalle commented Nov 19, 2020

bollwyvl commented Feb 11, 2021

btel commented Feb 11, 2021 •

edited

Loading

bollwyvl commented Feb 11, 2021

btel commented Feb 11, 2021

btel commented Feb 11, 2021

bollwyvl commented Feb 17, 2021

bollwyvl commented Mar 12, 2021

atrawog commented May 30, 2021

atrawog commented May 30, 2021

yuvipanda commented May 30, 2021

Large scale considerations #173

Large scale considerations #173

Comments

adriendelsalle commented Nov 19, 2020

bollwyvl commented Feb 11, 2021

btel commented Feb 11, 2021 • edited Loading

bollwyvl commented Feb 11, 2021

btel commented Feb 11, 2021

btel commented Feb 11, 2021

bollwyvl commented Feb 17, 2021

bollwyvl commented Mar 12, 2021

atrawog commented May 30, 2021

atrawog commented May 30, 2021

yuvipanda commented May 30, 2021

btel commented Feb 11, 2021 •

edited

Loading