Refactor Wikimedia Commons to use ProviderDataIngester #614

stacimc · 2022-07-20T21:19:33Z

Fixes

Fixes WordPress/openverse#1509 by @stacimc

Description

Refactors Wikimedia Commons to use the ProviderDataIngester base class

Notes

For Wikimedia Commons, the batch_limit applies to the number of image_pages to process, which is not necessarily a 1:1 relationship with records ingested. The result is that using a global ingestion_limit still works to limit the processing time but you may see slightly more records ingested.

Testing Instructions

Run the Wikimedia Commons DAG. Also try setting an ingestion_limit

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

AetherUnbound

Whew, this was a big one! 😄

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

AetherUnbound · 2022-07-21T19:09:59Z

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

+        streams = self.get_value_by_name(metadata, "streams")
+        try:
+            if not streams:
+                audio = self.get_value_by_name(metadata, "audio")
+                streams = self.get_value_by_name(audio, "streams")
+            streams_data = [stream["value"] for stream in streams][0]
+            file_data = self.get_value_by_name(streams_data, "header")
+            if not file_data:
+                file_data = streams_data
+        except (IndexError, TypeError):
+            file_data = []
+        return file_data


I know this is unaltered from the OG script, but could we alter this so that it checks for keys/indices rather than a try/except block?

This bit of logic is actually somewhat fiddly because of the structure of the data we get back from the API, which involves lists of dicts with "name" and "value" keys (example file). I pushed a refactor to remove the try/catch safely but I can see why they were added 😄

tests/dags/providers/provider_api_scripts/test_wikimedia_commons.py

sarayourfriend · 2022-08-04T14:47:46Z

@stacimc Is this PR still in progress? Is there anything you need to be able to continue the work on this?

AetherUnbound · 2022-08-12T04:06:04Z

This is in draft pending the completion of the v1.3.0 milestone - once that's complete this can be rebased and undrafted!

stacimc · 2022-08-24T17:44:51Z

Changing this to high priority as it blocks our ability to turn on reingestion workflows.

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

AetherUnbound

Added a few more thoughts!

AetherUnbound · 2022-08-29T21:59:05Z

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

+    def get_response_json(self, query_params):
+        """
+        Overrides the parent function to make multiple requests until
+        we see "batchcomplete", rather than a single request to the
+        endpoint. This ensures that global usage data used for calculating
+        popularity is tabulated correctly.


I wonder, should this be an override of get_batch instead? The logic within looks similar at least. I also thinking we might be able to use super().get_response_json within this function instead of accessing self.delayed_requester to make the request - that might be more robust to future changes.

I went back and forth on this in the initial implementation and decided to override get_response_json because I thought it was more likely for get_batch to have future changes (for example we're looking at addressing error handling there right now). Good point about using super().get_response_json here! Done ✔️

Sounds great, and I love the changes you made there 💯

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

AetherUnbound · 2022-08-29T22:13:07Z

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

+        creator, creator_url = self.extract_creator_info(media_info)
+        title = self.extract_title(media_info)
+        filesize = media_info.get("size", 0)  # in bytes
+        filetype = self.extract_file_type(media_info)
+        meta_data = self.create_meta_data_dict(record)


Just a note, it seems like we don't have a standardization for these kinds of function names, in that we have get_*, _get_*, extract_*, and _extract_* as prefixes for denoting pulling information out of a dictionary. This doesn't need to change here, but maybe we should pick one or the other and stick with it? I like extract for these cases, since it seems like we have other utility functions that are get-prefixed.

Could you open an issue for this, if you haven't already, @AetherUnbound ? It's nice to track it somewhere.

Won't this be partially dependent on the class? extract and get have slightly different connotations to me. In this class extract felt useful because many of the utilities were a little more complex than simply pulling data out of a response.

If we really want to aim for consistency with/without _ prefixing on the subclasses then I imagine we'd want a style guide or some sort of linting 🤔

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py

AetherUnbound

This looks great!

AetherUnbound · 2022-08-30T20:38:10Z

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

@@ -320,13 +320,13 @@ def get_batch(self, query_params: Dict) -> Tuple[Optional[List], bool]:

        return batch, should_continue

-    def get_response_json(self, query_params: Dict):
+    def get_response_json(self, query_params: Dict, **kwargs):


Ooo, this is smart! Much more robust than my approach in #701

openverse-bot · 2022-09-02T00:00:04Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

The ProviderDataIngesters need to include `media_type` in record data so that the ingester class knows which MediaStore to add the record to. To prevent error when calling add_item with the unexpected param, we accept optional kwargs. These are discarded before the record is added to a tsv.

krysal

This looks like a very solid refactor! I ran the DAG and from a brief view, I have only one question. I'm seeing many titles including HTML tags, stuff like <div class='fn'> Silver penny of Henry I</div>. I haven't run this DAG before so I don't know if it was a bug before the refactor or if it was a change on the Wikimedia API or waht, though I definitely haven't seen those tags on the frontend.

Do we want to address that bug here?

stacimc · 2022-09-06T20:37:06Z

@krysal Good catch! I just confirmed that the issue appears on main, so it's not a regression. I opened a new issue for this in WordPress/openverse#1441.

krysal

Great! Thanks for confirming @stacimc. LGTM!

stacimc added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Jul 20, 2022

stacimc self-assigned this Jul 20, 2022

AetherUnbound reviewed Jul 21, 2022

View reviewed changes

stacimc force-pushed the refactor/wikimedia-commons-to-use-base-class branch from 89f02de to abe67d5 Compare July 25, 2022 23:59

sarayourfriend mentioned this pull request Aug 4, 2022

Update reingestion workflows to load and report data #618

Merged

7 tasks

stacimc force-pushed the refactor/wikimedia-commons-to-use-base-class branch from bbecded to 1281943 Compare August 23, 2022 18:02

stacimc marked this pull request as ready for review August 23, 2022 21:04

stacimc requested a review from a team as a code owner August 23, 2022 21:04

stacimc requested review from krysal and AetherUnbound August 23, 2022 21:04

stacimc added 🟧 priority: high Stalls work on the project or its dependents and removed 🟩 priority: low Low priority and doesn't need to be rushed labels Aug 24, 2022

obulat reviewed Aug 26, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py Outdated Show resolved Hide resolved

obulat reviewed Aug 26, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py Outdated Show resolved Hide resolved

AetherUnbound reviewed Aug 30, 2022

View reviewed changes

AetherUnbound approved these changes Aug 30, 2022

View reviewed changes

stacimc mentioned this pull request Aug 30, 2022

Refactor Brooklyn Museum to use ProviderDataIngester #701

Merged

7 tasks

stacimc added 3 commits September 2, 2022 13:06

Refactor Wikimedia Commons to use ProviderDataIngester

408a954

Add main function

21df4ed

Rename image_pages to media_pages

3c78483

stacimc added 9 commits September 2, 2022 13:06

Update __init__ to match signature

1fb83d7

Remove try/catch

7b91d58

Pass date in when calling script via command line

b86f334

Remove unneeded url cleaning function

e2fa2d2

Clean up create_meta_data_dict

7246f47

Replace 'image' with 'media'

c886e38

Mark static methods, make naming consistent

0f79e80

Call through to super in get_response_json

6a72b97

stacimc force-pushed the refactor/wikimedia-commons-to-use-base-class branch from 88a621b to 6a72b97 Compare September 2, 2022 20:07

lint

845b2e4

krysal reviewed Sep 2, 2022

View reviewed changes

stacimc mentioned this pull request Apr 17, 2023

Provider scripts may include html tags in record titles WordPress/openverse#1441

Open

1 task

krysal approved these changes Sep 6, 2022

View reviewed changes

stacimc merged commit 18decf9 into main Sep 6, 2022

stacimc deleted the refactor/wikimedia-commons-to-use-base-class branch September 6, 2022 21:04

stacimc mentioned this pull request Apr 17, 2023

Consider setting category to 'illustration' for all svgs WordPress/openverse#1440

Open

1 task

AetherUnbound mentioned this pull request Apr 17, 2023

Wikimedia DAG iterates over thousands of pages without processing any data WordPress/openverse#1435

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Wikimedia Commons to use ProviderDataIngester #614

Refactor Wikimedia Commons to use ProviderDataIngester #614

stacimc commented Jul 20, 2022

AetherUnbound left a comment

AetherUnbound Jul 21, 2022

stacimc Aug 23, 2022

sarayourfriend commented Aug 4, 2022

AetherUnbound commented Aug 12, 2022

stacimc commented Aug 24, 2022

AetherUnbound left a comment

AetherUnbound Aug 29, 2022

stacimc Aug 30, 2022

AetherUnbound Aug 30, 2022

AetherUnbound Aug 29, 2022

obulat Sep 7, 2022

stacimc Sep 7, 2022

AetherUnbound left a comment

AetherUnbound Aug 30, 2022

openverse-bot commented Sep 2, 2022

krysal left a comment

stacimc commented Sep 6, 2022

krysal left a comment

Refactor Wikimedia Commons to use ProviderDataIngester #614

Refactor Wikimedia Commons to use ProviderDataIngester #614

Conversation

stacimc commented Jul 20, 2022

Fixes

Description

Notes

Testing Instructions

Checklist

Developer Certificate of Origin

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend commented Aug 4, 2022

AetherUnbound commented Aug 12, 2022

stacimc commented Aug 24, 2022

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openverse-bot commented Sep 2, 2022

Footnotes

krysal left a comment

Choose a reason for hiding this comment

stacimc commented Sep 6, 2022

krysal left a comment

Choose a reason for hiding this comment