API thumbnails, a quandry: to ASGI the Django monolith or create a new async Thumbnails service #2498
Replies: 9 comments 10 replies
-
Thanks for gathering this all together in one place Sara, and illuminating the differences too. My gut reaction is to go with the micro-service approach, with FastAPI in Python. We don't get the benefits of the fully async API, but then we also don't have to work around DRF not being async compatible. My fear with our team capacity would be that the full async conversion work might take significantly longer than a specific, built-for-purpose service. I also worry that we'd encounter a whole host of new problems on any number of routes that are the result of the async shift. A specific service would isolate the changes so we're not trying to tackle new problems across all routes. We can also work to optimize that service individually (since it consumes so much of our traffic) without unnecessarily optimizing other routes on the API. It would require more monitoring and alerting, but I feel our improved ECS structure makes that piece easier at least! It's definitely a quandary, I'm interested to see what others think! |
Beta Was this translation helpful? Give feedback.
-
My initial reaction is also in favor of FastAPI + Python-based microservices. We've taken great strides to be able to spin up new services in our infra repo fairly quickly and have a monorepo to support multiple packages. Fundamentally a microservices approach makes sense for me given the resource-intensive nature of some parts of the API: thumbnails, waveforms, and potentially in the future things like video thumbnailing, PDF preview generation, document OCR, or any other complex media analysis we need to perform. None of those services should be able to disrupt the entire API. An additional point, I'm pretty confident our thumbnail endpoint doesn't actually need to talk to the DB. We could, if it isn't true already, make sure that everything the service needs is already indexed in ElasticSearch. I imagine this thumbnail service could be very small and most of the code could be ported over from the existing app. I do appreciate you being explicit about the scaffolding, tooling, and devex work necessary here. I still fundamentally suspect it would be much less work than converting the entire API. |
Beta Was this translation helpful? Give feedback.
-
Thanks Sara, this is a really excellent write up. It’s exciting to see the thumbnail instability theory verified so strikingly. I voted in favor of a microservice. My gut feeling is that isolating the changes will be much less disruptive than converting the entire API and perhaps even easier, although I definitely feel some trepidation about the amount of dev tooling/documentation/infrastructure required.
This makes the speed of the microservice approach relative to ASGI Django particularly attractive 😮 |
Beta Was this translation helpful? Give feedback.
-
Thanks for exposing the pros & cons so detailed here! It becomes obvious to me due to the complexity of the first option that it is almost a non-starter, and then the microservice instead will fit perfectly for the case.
I imagine this will be a private service like the Ingestion Server, only the API will need to reach it so this save a lot of security-related headaches. |
Beta Was this translation helpful? Give feedback.
-
@dhruvkb and @obulat if you have any thoughts here, please share! I'd like to get a sense of what all Openverse maintainers think here before this month's priorities meeting, so we can make a determination about how rapidly to move on planning this project. |
Beta Was this translation helpful? Give feedback.
-
I didn't share my preference in the description and wanted to reserve it until other folks have shared their opinions. I am very obviously in the minority and I'll accept that my view is not shared by the folks who have expressed a preference thus far and that it is likely we could go in a direction I disagree with. I can accept that, and even in the minority, at the very least, I'll play devil's advocate a bit 🙂. Regardless of the approach we take, I hope what I share below urges us to more deeply consider these options and their implications and complexities (particularly that of the public-facing microservice approach). I think we would be better off pursuing Django ASGI. Even though DRF doesn't officially support async views, they've expressed explicitly in the linked issue that they think it's better suited for an extension rather than native support. [The Swapping out the Elasticsearch and Redis clients to use the async versions is trivial in my experience. Just swap out the client references and add That is to say: I think that the cost of (a) learning a new framework, (b) writing new local and deployed infrastructure and support for an entirely separate service (dependent service connections, caching, rate limiting, etc etc etc), (c) documenting all of that, (d) managing an additional service's security, and (e) planning it from the ground up, are roughly equivalent to the work, if not more work, that it would take to turn our Django monolith async. However, the ongoing maintenance of a separate service will never go away, whereas the up front work to convert our Django app would amortise, even if it did exceed the effort of creating an entirely new service. We should be careful not to underestimate the cost of developing an entirely new service, even if some of the code could be copy/pasted (with modifications). Thinking of planning alone, I would wager that an implementation plan to convert our Django app to ASGI would move more swiftly than a greenfield service in a framework that the team does not currently use or have any conventions for. A pettier (and admittedly a bit of a joke, but also slightly serious) problem is that of the name for this new service. God forbid we call it the "thumbnails" service and it eventually also handles waveform generation (a similarly time-expensive process that would benefit from asyncrony) or whatever else we'd need. Maybe "artefact generation service" would do, but my point is less that finding a useful name is hard and more that understanding the downstream implications of a microservice and how we would decide to extend that service vs creating a new one in the future for other similar quandries... all of this worries me 🙂 From a general philosophical point of view, I'm also roughly of the opinion (though, again, quite loosely held) that if something can be a monolith, it's probably better of being one. Microservices, based on my own anecdotal experience and that of others I've read and spoken to, are hard to get right and rarely the necessary solution, despite the unavoidable added complexity of them. All of this coupled with the incontrovertible fact that thumbnail generation is not the only part of our application that can benefit from asynchrony. Objectively speaking, our application is ripe for asynchronous programming because of how many outbound requests we make in all our critical paths. It would be sad to deny those improvements to search (our most critical path!) without either inevitably converting the Django app to ASGI (in which case, we've surely spent a great deal of unnecessary time writing this whole new service) or re-writing search (and thus, essentially, our entire app) into the new service (which is an absurd prospect given our team's resources). So, all of that to say, I am firmly in favour of converting Django to ASGI over the creation of a new service. However, I am also in the minority 🙂. What I would urge is us to not try to move too quickly on this and to truly understand the implications of whatever our choice is, especially in the long term. Notably, only @zackkrida has shared a concrete reason for why microservices would make sense long term. Other than that, the reasons shared by others appear to prioritise the short-term: either how quickly or disruptive it would be to write the new service vs converting the API. I'm sympathetic to Zack's reasoning that microservices are a nice way to isolate CPU intensive operations that shouldn't disrupt the rest of the service. However, I'd like to address it directly. We offset the cost of thumbnail generation to Photon or imaginary (if we fix #2442 that way), so neither effect the API's CPU right now (and indeed do not). The reason thumbnail generation is so tricky has nothing to do with CPU bound work and everything to do with long-running outbound requests. If (big if!) we want to isolate CPU consumption (or the like) for waveform generation or other future artefact generation we need to service clients, the optimal place to create that separation of concerns and isolation is not at the public-facing API level. Successful microservice architectures generally have an HTTP entrypoint that handles authorisation, rate limiting, etc, in a single place, and then delegates the requests to the "isolated services". If we implement an entirely new client-facing HTTP API for thumbnails in a microservice, it will need to reimplement our authorisation scheme and rate limiting. These are not trivial parts of our Django application. The optimal place to create the isolation would be after the HTTP access layer. If we wanted to offload waveform generation to a microservice, for example, we would create something like The critical difference in this approach is as I said above: it removes the need to duplicate authentication/authorisation and rate limiting (or offload those to yet another microservice, another solution some folks take). That is to say, it properly isolates the unit that needs to be isolated without requiring reimplementation (or abstraction, if breaking it into another microservce) of units that can (and should) be shared. To me these make microservices like the one we've considered so far for the thumbnails service a complete non-starter. We could try to do the other microservices approach, but that doesn't make sense for thumbnails: it's already as isolated as it could be from a CPU perspective. And if we did do that approach, we'd still need an asynchronous public HTTP entrypoint that reimplements our authentiation/authorisation and rate limiting. Literally the only thing we need for thumbnails is for the request lifecycle to be asynchronous. The only reason a microservice would facilitate that is to bypass the need to convert our entire API in service of that. However, as I've shown, that introduces a whole host of difficulties. To speak additionally to documentation difficulties for a public facing HTTP microservice: our current API documentation is generated using a plugin that parses the Django views. If we split the To summarise and clarify what could be misunderstood: I am not against Openverse using microservices entirely. I just do not think that this is the right place to use them for the following reasons:
Of all of these, the most philosophically important one is that I do not see microservices as the necessary solution to our current problem (because we do not currently have the problem Zack mentioned of CPU intensive operations that could interrupt other parts of the application). If we convert to Django ASGI, then when we do encounter a situation where a microservice fits the bill (to isolate CPU/memory/whatever usage to a separate box) we will already be equipped to make use of that because async Django can delegate requests to internal microservices without us needing to reimplement any of the HTTP entrypoint concerns. Curious to hear what the rest of the team thinks though as I know my opinion diverges significantly from what the rest of y'all have shared so far. |
Beta Was this translation helpful? Give feedback.
-
Thank you for such a detailed write-up, Sara! I've been swayed in both directions while reading the replies. The conversion of the DRF app into an async version, especially without official async support from DRF itself, as well as updating all of the dependencies to use async, seems like a really big undertaking. In addition to this, it will touch all of the endpoints and has a risk of breaking many more endpoints. However, the fact that a microservice will have to be public, and will need to handle the rate-limiting and auth separately will probably be more difficult to support down the line. |
Beta Was this translation helpful? Give feedback.
-
One important aspect of this which I find myself conflicted on is the necessity of authentication and rate limiting on these Generating these artifacts on demand as we do currently is time consuming, as the generation is bound by network and/or CPU to varying degrees depending on the type of artifact. Once generated, however, these artifacts can be cached indefinitely and served over a CDN with no traffic to our actual services. To me, then, assuming the risk of disrupting the API is minimized or eliminated; we actually benefit from unrestricted traffic to these endpoints insofar as it initializes artifact generation and "warms" our caches for all Openverse users. This may be an incorrect viewpoint and/or insufficient on it's own to justify microservices, but something I wanted to mention. |
Beta Was this translation helpful? Give feedback.
-
This was discussed in today's priorities meeting and everyone felt it would be a good idea to do some timeboxed testing of the ASGI and ADRF changes to help better inform a decision here. I'll create an issue for that and share it here. |
Beta Was this translation helpful? Give feedback.
-
Problem
In the last week we have deployed a new dedicated instance of our general Django API service and routed all thumbnail requests to the new instance. We've dubbed this the "API Thumbnails service". We did this in an effort to explore whether API instability was being caused by instability specifically in the thumbnails route. In particular, we wanted to see whether worker timeouts were being caused exclusively by long-running thumbnails requests and workers getting blocked up by them.
The issue for that work is in the private infrastructure repository: https://github.com/WordPress/openverse-infrastructure/issues/541 (it just outlines the process for deploying the new service, nothing else of interest)
Since deploying the thumbnails service to production yesterday, we have seen captured exceptions in Sentry plummet to nearly zero across the board for the general API service:
Specifically, the last of the "SystemExit" exceptions that showed Gunicorn aborting a worker due to timeouts happened at June 27, 2023 21:18:35 UTC. The thumbnails service was deployed at around June 27, 2023 22:10:00 UTC. Since then, we have seen none of the extremely reliable SystemExit exceptions happen on the general API service. We do, however, see those happening on the new API thumbnails service.
This effectively confirms our hypothesis that our API is currently incapable of effectively handling the complexities and load of thumbnails requests. Specifically, we:
Together with the fact that thumbnail requests account for nearly 90% of Openverse traffic that misses the cache, we need to develop a more performant approach to serving thumbnail requests. Two potential approaches are outlined below.
Why not convert the existing Django app to ASGI
Description
There are two distinct approaches that we can take to solve this. I've described them as I see them below. Both warrant deep thought and consideration. Neither is trivial.
Convert the Django monolith to ASGI
The existing API service is built on Django and Django Rest Framework. Django supports ASGI (async version of WSGI) out of the box. DRF does not. There are ways to make DRF work with async views (as outlined in the issue), but they are non-trivial. We would need to re-write essentially every view to prevent new performance problems from being introduced if we converted our Django app to ASGI.
However, both of the issues listed above could be solved within our existing app if we converted it to ASGI. The first just by nature of changing the HTTP requests in the thumbnails route to be asynchronous and not block the worker1. The second issue could also be solved because we could return responses before the async requests to retrieve the thumbnail from Photon is complete and still ensure that Photon cached the response (or cache Photon's error response so we avoid making the same request unnecessarily in the future).
The most complicated aspects of this approach is that we need to convert our entire service to async. This might include needing to change our Elasticsearch and Redis clients to the async version (both are built into the libraries we already use) right off the bat. We would probably have to make significant changes or tweaks to a large number of unit tests to accommodate that or general ASGI conversion.
Something to keep in mind for this approach, however, that is perhaps its most significant benefit, is that it isn't just the thumbnails endpoint that could benefit from being async. Search, for example, currently makes multiple requests to Elasticsearch, Redis, and makes several requests to filter dead links. All of those can benefit from being asynchronous because it would free up the worker to work on another request while it waits for ES/Redis/Dead link requests to finish. Waveform generation would also easily benefit from this for similar reasons: the worker can service another request whilst the waveform generation happens in the subprocess.
Benefits:
Drawbacks:
Create an async micro-service
Alternatively to attempting a conversion of our entire application from WSGI to ASGI Django, we could opt to create a fit-to-purpose service built from the ground up using asynchronous programming. From my perspective, there are only two languages we should consider for this: Node and Python. I believe both of these languages are capable of fulfilling the needs of this new service. We already use and have advanced knowledge throughout the team in these languages. Adding a new language to our stack (Go being perhaps the most obvious choice) is enticing but I think ultimately a good way to make this more complicated than it needs to be. Of these, I lean heavily towards Python only because all existing code dealing with servicing thumbnails requests is already in Python and it's likely that many aspects of this service could be copy/pasted with light refactors for asynchrony and to accommodate a different framework (if necessary).
Regardless, both languages have excellent async support (nowadays) with frameworks galore to choose from. For both, I'd recommend the following:
Python: Falcon or FastAPI. Both are very fast and support async views out of the box. We already use Falcon for the ingestion server. FastAPI is designed for async, but Falcon support for async is fine. Neither provide an database access layer but SQLAlchemy supports asyncio.
Node: Express or Koa (the "next gen" from the same folks as Express). Both are fast. Neither provide a database access layer. Knex is probably the best option here (Prisma is awesome but it has a lot of heavy tooling requirements to work as expected and also really wants to manage migrations for you, which we don't need as they'll be handled with Django).
None of these options come with a database access layer built in. That's fine as SQLAlchemy for Python and Kenx for Node are both excellent options if we want them. However, we could theoretically significantly simplify things and avoid needing any database access if we relied on the general API's single result view to retrieve information for the thumbnail (specifically, the relevant upstream image URL). This would avoid the complexities of needing to make sure data model changes are reflected in multiple places. It would introduce some latency into the request, but that might be worth it for reduced overall complexity.
Benefits:
Drawbacks:
Summary
In summary, we have two potential approaches:
To begin the discussion, I'd like to ask two questions:
Please share your thoughts below 🙂
Footnotes
We can't just use
async_to_sync
orsync_to_async
within the thumbnails view. All that does is make async or sync code compatible with the other. If the app is still a WSGI app, and therefore fundamentally a synchronous app, workers are bound to a single request at a time. Those utilities allow us to have asynchronous aspects to the request, but without converting the entire app to ASGI, the entire lifetime of the request would still be bound to a single worker, even if that request did things "asynchronously". ↩7 votes ·
Beta Was this translation helpful? Give feedback.
All reactions