-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAG for generating and inserting Rekognition labels #4836
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4836 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, super clear and I really love the way you've handled the variables here, something for me to think about! This tested well for me; I tried all the bonus testing instructions and got to see the tags excluded from the data refresh and then included once the provider was removed from the filtering. I also tested throwing an error midway through and then retriggering the DAG, worked great 🥳 My only blocking request is the comment about accurately reporting the failure count.
|
||
|
||
@task | ||
def run_sql( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I'm also pulling this out in #4833 in the same way, perfect 😄
catalog/dags/data_augmentation/rekognition/add_rekognition_labels_dag.py
Outdated
Show resolved
Hide resolved
# Templated variables for the DAG | ||
TEMPLATE_S3_PREFIX = ( | ||
"{{ var.value.get('REKOGNITION_DATASET_PREFIX', '%s') }}" % S3_FILE_PREFIX # noqa: UP031 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really fantastic, I love how you've handled the variables per-environment and defaults here ✨
catalog/dags/data_augmentation/rekognition/add_rekognition_labels.py
Outdated
Show resolved
Hide resolved
catalog/dags/data_augmentation/rekognition/add_rekognition_labels_dag.py
Show resolved
Hide resolved
@stacimc I've included a total failed count and addressed the other issues you brought up - hopefully everything looks good now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! 🚢
Fixes
Fixes #4645 by @AetherUnbound
Description
This PR adds a DAG which will parse and insert Rekognition labels into the catalog database. See the issue above for motivation and steps, and the generated DAG documentation for a description of the DAG. In short, this DAG iterates over the contents of the labels file in chunks, and inserts the records into a temporary table. The position in the file is tracked, and parsing can be resumed if any failures occur. Once the records are in the temporary table, a
batched_update
run is triggered to upsert the tags from the temporary table into the baseimage
table.In order to make this possible, I've made a few changes to other parts of the project:
data_augmentation
, where I imagine other augmentation DAGs will go in the future (e.g. the dead link pipeline)load_to_s3_entrypoint.sh
to ensure that the test files get loaded in properly once Minio is initializedsmart_open
, to facilitate seamless chunked reading from S3additional_where
parameter to thebatched_update
DAG in order to facilitate reading from a separate table as part of the insertWarning
Due to the addition of a new dependency, the catalog should be deployed immediately after this DAG is merged.
Testing Instructions
ov just down -v
to clear everythingcatalog/.env
file:ov just api/init
to initialize the catalog database with dataov just c
to start the catalog services (bonus: runov just logs load_to_s3
to verify the new file gets loaded into Minio)batched_update
DAGadd_rekognition_labels
and wait for it to completenotify_parse_complete
logs to ensure 200 were loaded and 1 was skippedov just catalog/pgcli
then\x on
(to show full results) thenSELECT identifier, tags FROM image WHERE identifier = 'b840de61-fb9d-4ec5-9572-8d778875869f';
- both Flickr and Rekognition tags should show up for this resultimage_data_refresh
DAG by enabling it. Once that is complete, the Rekognition results should NOT show up in the API due to our wholesale filtering of Rekognition (e.g. http://localhost:50280/v1/images/b840de61-fb9d-4ec5-9572-8d778875869f/ should not have any Rekognition results)openverse/ingestion_server/ingestion_server/cleanup.py
Line 60 in 5ccf7ff
SLACK_MESSAGE_OVERRIDE
Variable totrue
and run theadd_rekognition_labels
DAG again to see the Slack messages that would be generated from a given run.Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin