Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Wikimedia Commons to use ProviderDataIngester #614

Merged
merged 13 commits into from
Sep 6, 2022

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Jul 20, 2022

Fixes

Fixes WordPress/openverse#1509 by @stacimc

Description

Refactors Wikimedia Commons to use the ProviderDataIngester base class

Notes

For Wikimedia Commons, the batch_limit applies to the number of image_pages to process, which is not necessarily a 1:1 relationship with records ingested. The result is that using a global ingestion_limit still works to limit the processing time but you may see slightly more records ingested.

Testing Instructions

Run the Wikimedia Commons DAG. Also try setting an ingestion_limit

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Jul 20, 2022
@stacimc stacimc self-assigned this Jul 20, 2022
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, this was a big one! 😄

Comment on lines 225 to 236
streams = self.get_value_by_name(metadata, "streams")
try:
if not streams:
audio = self.get_value_by_name(metadata, "audio")
streams = self.get_value_by_name(audio, "streams")
streams_data = [stream["value"] for stream in streams][0]
file_data = self.get_value_by_name(streams_data, "header")
if not file_data:
file_data = streams_data
except (IndexError, TypeError):
file_data = []
return file_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is unaltered from the OG script, but could we alter this so that it checks for keys/indices rather than a try/except block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of logic is actually somewhat fiddly because of the structure of the data we get back from the API, which involves lists of dicts with "name" and "value" keys (example file). I pushed a refactor to remove the try/catch safely but I can see why they were added 😄

@stacimc stacimc force-pushed the refactor/wikimedia-commons-to-use-base-class branch from 89f02de to abe67d5 Compare July 25, 2022 23:59
@sarayourfriend
Copy link
Contributor

@stacimc Is this PR still in progress? Is there anything you need to be able to continue the work on this?

@AetherUnbound
Copy link
Contributor

This is in draft pending the completion of the v1.3.0 milestone - once that's complete this can be rebased and undrafted!

@stacimc stacimc force-pushed the refactor/wikimedia-commons-to-use-base-class branch from bbecded to 1281943 Compare August 23, 2022 18:02
@stacimc stacimc marked this pull request as ready for review August 23, 2022 21:04
@stacimc stacimc requested a review from a team as a code owner August 23, 2022 21:04
@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents and removed 🟩 priority: low Low priority and doesn't need to be rushed labels Aug 24, 2022
@stacimc
Copy link
Contributor Author

stacimc commented Aug 24, 2022

Changing this to high priority as it blocks our ability to turn on reingestion workflows.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more thoughts!

Comment on lines +86 to +90
def get_response_json(self, query_params):
"""
Overrides the parent function to make multiple requests until
we see "batchcomplete", rather than a single request to the
endpoint. This ensures that global usage data used for calculating
popularity is tabulated correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, should this be an override of get_batch instead? The logic within looks similar at least. I also thinking we might be able to use super().get_response_json within this function instead of accessing self.delayed_requester to make the request - that might be more robust to future changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this in the initial implementation and decided to override get_response_json because I thought it was more likely for get_batch to have future changes (for example we're looking at addressing error handling there right now). Good point about using super().get_response_json here! Done ✔️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great, and I love the changes you made there 💯

creator, creator_url = self.extract_creator_info(media_info)
title = self.extract_title(media_info)
filesize = media_info.get("size", 0) # in bytes
filetype = self.extract_file_type(media_info)
meta_data = self.create_meta_data_dict(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, it seems like we don't have a standardization for these kinds of function names, in that we have get_*, _get_*, extract_*, and _extract_* as prefixes for denoting pulling information out of a dictionary. This doesn't need to change here, but maybe we should pick one or the other and stick with it? I like extract for these cases, since it seems like we have other utility functions that are get-prefixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open an issue for this, if you haven't already, @AetherUnbound ? It's nice to track it somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be partially dependent on the class? extract and get have slightly different connotations to me. In this class extract felt useful because many of the utilities were a little more complex than simply pulling data out of a response.

If we really want to aim for consistency with/without _ prefixing on the subclasses then I imagine we'd want a style guide or some sort of linting 🤔

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

@@ -320,13 +320,13 @@ def get_batch(self, query_params: Dict) -> Tuple[Optional[List], bool]:

return batch, should_continue

def get_response_json(self, query_params: Dict):
def get_response_json(self, query_params: Dict, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, this is smart! Much more robust than my approach in #701

@openverse-bot
Copy link
Contributor

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

The ProviderDataIngesters need to include `media_type` in record data so that the
ingester class knows which MediaStore to add the record to. To prevent error when
calling add_item with the unexpected param, we accept optional kwargs. These are
discarded before the record is added to a tsv.
@stacimc stacimc force-pushed the refactor/wikimedia-commons-to-use-base-class branch from 88a621b to 6a72b97 Compare September 2, 2022 20:07
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very solid refactor! I ran the DAG and from a brief view, I have only one question. I'm seeing many titles including HTML tags, stuff like <div class='fn'> Silver penny of Henry I</div>. I haven't run this DAG before so I don't know if it was a bug before the refactor or if it was a change on the Wikimedia API or waht, though I definitely haven't seen those tags on the frontend.

Do we want to address that bug here?

@stacimc
Copy link
Contributor Author

stacimc commented Sep 6, 2022

@krysal Good catch! I just confirmed that the issue appears on main, so it's not a regression. I opened a new issue for this in WordPress/openverse#1441.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for confirming @stacimc. LGTM!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Wikimedia Commons to use ProviderDataIngester
6 participants