Database Unification #120
Replies: 3 comments 7 replies
-
It came up in the discussion in the make slack (requires a free account here to view) but it should be possible to use the Django ORM outside of Django. Instead of jumping into another webframework like FastAPI (as exciting a technology as it is) we can stay on the thing that we already all know well and is working well, and replace our database connections in Airflow with Django instead of SQLAlchemy. There's a semi-maintained package for making Django ORM and Airflow interoperability easier (though I haven't evaluated it closely for whether it's an effective solution). It might not even be necessary. Unifying the databases seems like a very good idea. There's a ton of complexity that can be removed by doing so. Even if that does end up meaning re-writing the API into a different framework that is compatible with SQLAlchemy it seems worth doing. |
Beta Was this translation helpful? Give feedback.
-
This is a phenomenal proposal that makes me very excited for the future of the project. I think everything you've said is spot-on. Anything we can do to reduce the scope of the catalog to "gathering clean, normalized metadata for openly-licensed works" and little else is a huge gain. I also want to re-emphasize that we already have so much redundancy baked in, that two DBs feels like such overkill even without all of the other benefits:
I'll maybe leave additional comments about implementation details later on. Thank you! |
Beta Was this translation helpful? Give feedback.
-
I used to work in this codebase quite a bit. While the split to two databases was before even my time, I have some context to add that is missing (or at least I missed it) here. In particular, you seem to be missing why the current process even exists. The main problem that all this process is solving is the fact that whenever you write into a table in PostgreSQL, it rewrites all indices on that table after each insert. This means that at scale, if you have a table heavily indexed for quick reading, you cannot write quickly into that table. Conversely, if you don't have many indices on a table (to enable quick writing), then you have slow reads, as well as the loss of integrity. This implies a couple things:
SuggestionWe were, way back in 2019, planning to move in the direction of ingesting directly to S3 in a less-structured form, and then putting together a proper ETL pipeline that would load data from S3 into the service DB on some schedule, eventually letting us get rid of the Catalog DB. By the way, I agree that having the second DB is a problem, but I think the solution is going to me more complicated than you'd like. I don't think you're going to be able to avoid some kind of ETL step at your scale, and I recommend considering something like this. Alternatively, you can attempt doing some better data modelling in the service DB to achieve a more normalized DB schema. However, I'd be pretty surprised if you can get both the read and write performance you need at scale just by normalizing the schema (in fact, I'd expect the read performance to suffer immediately, and the write performance not to improve that much). If you want to experiment with the performance hit from ingesting directly into the service DB, try adding the same indices to the catalog DB that exist on the service DB to see what happens to the ingestions. Finally, on the popularity data calculation. If you are moving away from the current model, you don't need that view materialized. The only reason for that was to deal with some fussiness around how PostgreSQL deals with permissions when accessing functions in a foreign database. The performance of the functions that do the calculation is plenty fast to just do it on the fly. The slow part (even when indexed) is the lookup, not the calculation. If you want more info or context, just @ me so I see it. |
Beta Was this translation helpful? Give feedback.
-
Audience: Openverse team & community
Purpose: Describe our current data infrastructure and propose a more unified/simpler solution
Outline:
Current infrastructure
Description
Presently we have two databases: the Catalog database and the API database (henceforth "Catalog" and "API"). These databases reflect each other quite closely, although the Catalog is almost entirely unindexed. When a provider API script is run, it generates a TSV that is then loaded into the Catalog database. Some cleaning is performed at this point (e.g. verifying license information), but there are no constraints to what can be added to the Catalog.
Every day, the data from a given media table (e.g.
image
) is loaded onto an associated materialized view. This view adds on standardized popularity information for the media. Once a month, calculations are run across all images for a given provider to determine the standardized popularity constant, which is then used to normalize popularity across all images. It's worth noting that the provider field used to group the media is also unindexed.Once a week, the ingestion server initiates a "data refresh". During this process, it establishes a connection from the API to the Catalog using a foreign data wrapper. The server creates a temporary table that mirrors the media table on the API but without any indices. It then copies data from the media materialized view in the Catalog into this temporary table. It performs cleaning on the media data, then attempts to re-apply the indices that are present on the production table. Currently these indices do not get renamed to the appropriate names after the go-live swap, and the index re-application has not run successfully since July 2021. Theoretically, once the indices have been successfully re-applied, the temporary table is swapped for the production one.
After all the above steps are completed, the Elasticsearch cluster is reindexed and the API can serve this new data.
Problems with the current infrastructure
Here are a few (non-exhaustive) issues I've identified with the current infrastructure.
Proposed infrastructure
I believe that the best way to improve our infrastructure is to collapse the two databases into a single database and have a single source of truth for the schema and migrations. This would involve increasing the scope of the provider API scripts and removing the ingestion server entirely. I also recommend that we modify our materialized views to make new data accessible to the API as soon as possible.
Single database
As far as I can tell, there is not a significant and binding advantage to maintaining two databases. It would be feasible to have the provider API scripts and API server interact with the same database. Additional care could be taken on ingestion to ensure that the provider API data is cleaned and meets the constraints of the database.
This also shortens the time it takes for new data to be made available in the API; as soon as the new data has been loaded into the table, it's ready to be served by the API.
This would cut down our AWS costs considerably, since we would no longer be maintaining two costly databases.
Single source of truth
Ideally we would be able to refer to a single source of truth for this single database rather than maintain multiple implementations of the models. This could be accomplished by adding a new
openverse-database
repository to our project. This repository would provide some interface to the models (that both the catalog and API can use) and a mechanism for migrating the database. We have a choice between two major libraries here: Flask (SQLAlchemy) or Django.Since the API is already using Django, this could be developed as a Django plugin. The models could then be used by both the catalog and API, although only one server should be given the "migration" mechanism (or have an entirely different process carved out for database migration). Airflow uses Flask/SQLAlchemy, which might cause problems/conflicts in the runtime environment if we also attempt to enable the use of Django models in the provider API scripts. Django does not have first class/plugin support for materialized views and extensions, as far as I can tell. This could make some of our migrations tedious & repetitive.
Alternatively, we could define the database models using SQLAlchemy. SQLAlchemy does provide plugin support for materialized views, functions, and extensions1. With it we could leverage other popular and rapidly growing API frameworks on the API server end, like FastAPI2. This wouldn't necessarily require an immediate re-write of the API; we could set the Django models as "unmanaged" and have them reference the tables defined by this database repo. That would present some amount of duplication until the API had been re-written in a framework compatible with SQLAlchemy.
Increase responsibility of provider API scripts
One key piece of this infrastructure is pushing as much of the data integrity checks, data cleaning, and data ingestion time into the provider API scripts as possible. Airflow can manage this process quite easily. Performing this piece in Airflow gives us access to all of the Airflow notification & reporting infrastructure as well, so we can be notified immediately of data integrity issues.
I believe we can add complexity at this step without affecting the ease of contribution for community contributed provider API scripts. It would require expanding our provider API script utilities to encompass this complexity/functionality.
This has the added advantage of centralizing all of the data ingestion steps in the catalog, rather than having them be spread across the catalog and the ingestion server.
Separating popularity information
Our current data structure requires refreshing a materialized view in order to make any new data available to the API (among other steps). This is required so that standardized popularity data can be added for each media. It may be possible to have the API interact directly with the media tables without popularity data and move that information to a separate table/matview. This table would be joined to the media table when reindexing Elasticsearch, potentially with a sensible default for missing data. The materialized view refresh could then occur asynchronously without affecting the presence of new data.
This does bring up the question of "what do we do with data that does not yet have popularity data". Should it be indexed in Elasticsearch anyway? At what priority/popularity level? I think it behooves us to show the user every image we might have for something, even if we display images without popularity data last.
Diagram of proposal
Image description
The image is a set of two "swimlane" diagrams describing the current infrastructure and the proposed infrastructure. Both diagrams have steps which use certain shapes to showcase which database they interact with. Runtime duration information is provided where available.
The top diagram is composed of three lanes, "Catalog", "Ingestion Server", "API" from top to bottom. The diagram represents the following flow:
The bottom diagram is only two lanes, "Catalog" and "API" from top to bottom. The diagram represents the following flow (only a single database is referenced for all steps):
Benefits
The following are the projected benefits of the proposed system:
Footnotes
https://github.com/olirice/alembic_utils ↩
https://fastapi.tiangolo.com/ ↩
Beta Was this translation helpful? Give feedback.
All reactions