-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Wikimedia Commons to use ProviderDataIngester #614
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, this was a big one! 😄
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Outdated
Show resolved
Hide resolved
streams = self.get_value_by_name(metadata, "streams") | ||
try: | ||
if not streams: | ||
audio = self.get_value_by_name(metadata, "audio") | ||
streams = self.get_value_by_name(audio, "streams") | ||
streams_data = [stream["value"] for stream in streams][0] | ||
file_data = self.get_value_by_name(streams_data, "header") | ||
if not file_data: | ||
file_data = streams_data | ||
except (IndexError, TypeError): | ||
file_data = [] | ||
return file_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is unaltered from the OG script, but could we alter this so that it checks for keys/indices rather than a try/except block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of logic is actually somewhat fiddly because of the structure of the data we get back from the API, which involves lists of dicts with "name" and "value" keys (example file). I pushed a refactor to remove the try/catch safely but I can see why they were added 😄
tests/dags/providers/provider_api_scripts/test_wikimedia_commons.py
Outdated
Show resolved
Hide resolved
tests/dags/providers/provider_api_scripts/test_wikimedia_commons.py
Outdated
Show resolved
Hide resolved
89f02de
to
abe67d5
Compare
@stacimc Is this PR still in progress? Is there anything you need to be able to continue the work on this? |
This is in draft pending the completion of the v1.3.0 milestone - once that's complete this can be rebased and undrafted! |
bbecded
to
1281943
Compare
Changing this to high priority as it blocks our ability to turn on reingestion workflows. |
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more thoughts!
def get_response_json(self, query_params): | ||
""" | ||
Overrides the parent function to make multiple requests until | ||
we see "batchcomplete", rather than a single request to the | ||
endpoint. This ensures that global usage data used for calculating | ||
popularity is tabulated correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, should this be an override of get_batch
instead? The logic within looks similar at least. I also thinking we might be able to use super().get_response_json
within this function instead of accessing self.delayed_requester
to make the request - that might be more robust to future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this in the initial implementation and decided to override get_response_json
because I thought it was more likely for get_batch
to have future changes (for example we're looking at addressing error handling there right now). Good point about using super().get_response_json
here! Done ✔️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, and I love the changes you made there 💯
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Outdated
Show resolved
Hide resolved
creator, creator_url = self.extract_creator_info(media_info) | ||
title = self.extract_title(media_info) | ||
filesize = media_info.get("size", 0) # in bytes | ||
filetype = self.extract_file_type(media_info) | ||
meta_data = self.create_meta_data_dict(record) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note, it seems like we don't have a standardization for these kinds of function names, in that we have get_*
, _get_*
, extract_*
, and _extract_*
as prefixes for denoting pulling information out of a dictionary. This doesn't need to change here, but maybe we should pick one or the other and stick with it? I like extract
for these cases, since it seems like we have other utility functions that are get
-prefixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you open an issue for this, if you haven't already, @AetherUnbound ? It's nice to track it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be partially dependent on the class? extract
and get
have slightly different connotations to me. In this class extract
felt useful because many of the utilities were a little more complex than simply pulling data out of a response.
If we really want to aim for consistency with/without _
prefixing on the subclasses then I imagine we'd want a style guide or some sort of linting 🤔
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
@@ -320,13 +320,13 @@ def get_batch(self, query_params: Dict) -> Tuple[Optional[List], bool]: | |||
|
|||
return batch, should_continue | |||
|
|||
def get_response_json(self, query_params: Dict): | |||
def get_response_json(self, query_params: Dict, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo, this is smart! Much more robust than my approach in #701
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
The ProviderDataIngesters need to include `media_type` in record data so that the ingester class knows which MediaStore to add the record to. To prevent error when calling add_item with the unexpected param, we accept optional kwargs. These are discarded before the record is added to a tsv.
88a621b
to
6a72b97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very solid refactor! I ran the DAG and from a brief view, I have only one question. I'm seeing many titles including HTML tags, stuff like <div class='fn'> Silver penny of Henry I</div>
. I haven't run this DAG before so I don't know if it was a bug before the refactor or if it was a change on the Wikimedia API or waht, though I definitely haven't seen those tags on the frontend.
Do we want to address that bug here?
@krysal Good catch! I just confirmed that the issue appears on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks for confirming @stacimc. LGTM!
Fixes
Fixes WordPress/openverse#1509 by @stacimc
Description
Refactors Wikimedia Commons to use the ProviderDataIngester base class
Notes
For Wikimedia Commons, the
batch_limit
applies to the number ofimage_pages
to process, which is not necessarily a 1:1 relationship with records ingested. The result is that using a globalingestion_limit
still works to limit the processing time but you may see slightly more records ingested.Testing Instructions
Run the Wikimedia Commons DAG. Also try setting an
ingestion_limit
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin