-
Notifications
You must be signed in to change notification settings - Fork 54
Use the default provider categories during ingestion #635
Conversation
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks straightforward! Just one question.
Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
Signed-off-by: Olga Bulat <obulat@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@@ -121,7 +121,6 @@ class ImageCategory(Enum): | |||
|
|||
# Default image category by source | |||
DEFAULT_IMAGE_CATEGORY = { | |||
"flickr": ImageCategory.PHOTOGRAPH.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mention why this is being removed? As I understand Flickr is exclusively photos (and some videos) and not illustrations or other media types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flickr returns one of 3 type
values, and we set the category as photograph
only to the photos
, not screenshots
or other
:
def _get_category(image_data):
"""
Flickr has three types:
0 for photos
1 for screenshots
3 for other
Treating everything different from photos as unknown.
"""
if "content_type" in image_data and image_data["content_type"] == "0":
return ImageCategory.PHOTOGRAPH.value
return None
So, even before this PR, Flickr script was using photograph
as the category for type 0. I don't know if we should investigate what types 1 and 3 are, and whether we can use them for categorizing as well.
@@ -6,6 +6,7 @@ | |||
|
|||
from common.extensions import extract_filetype | |||
from common.licenses import is_valid_license_info | |||
from common.loader import provider_details as prov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think I personally prefer the full provider
here instead of prov
because it's clearer, but it doesn't matter much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the full provider
here, but it's prov
in all of the provider scripts, so I just used this for consistency (so that I don't have to change what I copy-pasted :) ) Do you think we should open an issue to replace provider
with prov
, @zackkrida?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to using the full name rather than an alias 😄 I think an issue for that would be excellent! It'd be a solid "good first issue" as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Some small nits (take them or leave them) but otherwise looks great. Successfully saw the digitized_artwork
category on my 6k Cleveland Museum local DAG run.
Co-authored-by: Zack Krida <zackkrida@pm.me>
Fixes
Fixes WordPress/openverse#1683 by @dhruvkb
Description
This PR adds a default
category
to media items in theMediaStore
'sclean_media_metadata
method.With this change, this is how
category
for an item is determined:get_record_data
method (or when callingadd_item
with old scripts).MediaStore
uses thecategory
value a provider script passes.category
isNone
,MediaStore
tries to add the category that is default for provider, and is listed inprovider_details.py
module.Testing Instructions
Run one of the data collection DAGs until you see
Wrote 100 lines to the disc
in thePull data
task log. Stop the DAG. Wait a couple of seconds until the DAG finishes, and then check the data in the database (I use DataGrip connected to the localhost's DB). Thecategory
column should be filled. For example, for "clevelandmuseum", category should be set todigitized_artwork
, and for "sciencemuseum", it should be set tophotograph
.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin