Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Update PhyloPic DAG to use API v2 #1060

Merged
merged 17 commits into from
Mar 30, 2023
Merged

Update PhyloPic DAG to use API v2 #1060

merged 17 commits into from
Mar 30, 2023

Conversation

krysal
Copy link
Member

@krysal krysal commented Mar 24, 2023

Fixes

Fixes WordPress/openverse#902 🚨
Fixes WordPress/openverse#1288

Description

This PR updates PhyloPic to use v2 of the API, and with this, it's now converted to a non-dated DAG since v2 doesn't support a filter by modified date yet. The good news is that getting the fields is much easier and should run faster since we can get more items per request.

@krysal krysal added 🟥 priority: critical Must be addressed ASAP 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Mar 24, 2023
@AetherUnbound
Copy link
Contributor

Thanks for jumping on this @krysal!! We actually got a response back from Mike Keesey in the upstream ticket I filed, looks like we might have some new endpoints available: keesey/phylopic#5 (comment)

I'll be responding to their comment today 😄

@krysal krysal marked this pull request as ready for review March 24, 2023 20:56
@krysal krysal requested a review from a team as a code owner March 24, 2023 20:56
@krysal krysal requested review from obulat and stacimc March 24, 2023 20:56
@krysal
Copy link
Member Author

krysal commented Mar 24, 2023

I read it! While that gets shipped, this update will enable us to reactivate PhyloPic after a run :)

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix of a ❗ critical issue, @krysal!

I will approve after the creator_url and last page are fixed.

if response_json and response_json.get("success") is True:
return response_json.get("result")
def get_should_continue(self, response_json):
return self.current_page < self.total_pages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self.current_page < self.total_pages
return self.current_page <= self.total_pages

This DAG ran really well until the last page, when the last requested page returned 404:

[2023-03-27, 15:45:24 +03] {requester.py:78} WARNING - Unable to request URL: https://api.phylopic.org/images?build=196&page=146&embed_items=true  Status code: 404
[2023-03-27, 15:45:24 +03] {requester.py:148} WARNING - Bad response_json:  None
[2023-03-27, 15:45:24 +03] {requester.py:149} WARNING - Retrying:
_get_response_json(
    https://api.phylopic.org/images,
    {'build': 196, 'page': 146, 'embed_items': 'true'},
    retries=-1)
[2023-03-27, 15:45:24 +03] {requester.py:134} ERROR - No retries remaining.  Failure.
[2023-03-27, 15:45:24 +03] {media.py:230} INFO - Writing 74 lines from buffer to disk.
[2023-03-27, 15:45:24 +03] {provider_data_ingester.py:501} INFO - Committed 6974 records
[2023-03-27, 15:45:24 +03] {taskinstance.py:1768} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: Retries exceeded
query_params: {"build": 196, "page": 146, "embed_items": "true"}

The initial_request.json(tests/dags/providers/provider_api_scripts/resources/phylopic/initial_request.json) seems to suggest that we should use <= or even == here, because even thought the totalPages is 145, the lastPage is 144.

  "lastPage": {
    "href": "/images?build=194&page=144"
  },
  "totalPages": 145

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I noticed it's trying to request the page 146, which is actually the page 147 because it starts at 0, from a total of 146, so this requires changing the initial page to 1 then we pass current_page - 1 as the query parameter and the condition < is maintained, so we don't try to get a page that doesn't exists and avoid the DAG fails.

"title": title,
"meta_data": meta_data,
# TODO: Evaluate whether to include upstream thumbnails.
# Sizes available: 192x192, 128x128, 64x64.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the other images we get in the sample_record.json: while the SVGs are all crisp on any resolution, the raster image of an acceptable size (https://images.phylopic.org/images/5b1e88b5-159d-495d-b8cb-04f9e28d2f02/raster/512x480.png) looks much worse.

I wonder if we should handle SVGs differently: they can be basically any size, their width and height don't mean much beyond the aspect ratio (which of course is very important for the frontend rendering), and their thumbnails will probably be larger in size and worse in quality than the original image. Not for this PR specifically, of course.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the raster images are an option but look really blurry, not sure if it's worth including those for thumbnails.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@AetherUnbound
Copy link
Contributor

@krysal it seems the old parameters have been re-added to the new API 😮 keesey/phylopic#5. How would you like to proceed? I'm happy to review this if you'd like to get it in and follow up with making this a dated DAG once more, or we could pivot to just updating the dated portions of the existing DAG.

@krysal
Copy link
Member Author

krysal commented Mar 27, 2023

@AetherUnbound It's great the filters are available already. I'd like to merge this so we can run the DAG to fix all the PhyloPic data and after that we can improve making the DAG dated again along with the other TODO notes I left.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for jumping on this so fast! Code-wise, I just have a few non-blocking comments.

On the data front though, it looks like this will give us an entirely new set of Phylopic data (rather than updating existing values). I was able to test this by:

  1. Download production phylopic data: \copy (SELECT * FROM image WHERE provider='phylopic') to '/tmp/phylopic_v1_prod.csv' DELIMITER ',' CSV HEADER. This returned 4,156 rows
  2. Start the catalog stack (I had to rebuild the postgres image with a newer version of pgcli, PR coming out for that in a little bit)
  3. Copy the file into the postgres image: docker cp /tmp/phylopic_v1_prod.csv openverse-catalog-postgres-1:/tmp/phylopic_v1_prod.csv
  4. Load the file into the table using j db-shell: \copy image from '/tmp/phylopic_v1_prod.csv' WITH (FORMAT CSV, HEADER)
  5. Run the full DAG, which processed 6,974 rows.
  6. Ran a count using j db-shell:
openledger> select count(*) from image;
+-------+
| count |
|-------|
| 10521 |
+-------+
SELECT 1
Time: 0.040s

All that to say that we will be losing the old data, which I think is fine because it's not working anyway 😅 We could explicitly delete the existing Phylopic data from the catalog prior to running this DAG though, that might be the best bet to ensure we don't have records hanging around that can't be accessed.

"""Get the required `build` param from the API and set the total pages."""
resp = self.get_response_json(query_params={})
if not resp:
raise Exception("No response from Phylopic API.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this might be a more specific exception?

Suggested change
raise Exception("No response from Phylopic API.")
raise ValueError("No response from Phylopic API.")

@AetherUnbound AetherUnbound mentioned this pull request Mar 27, 2023
8 tasks
@krysal
Copy link
Member Author

krysal commented Mar 27, 2023

Good find! Unfortunately, it seems some olders rows have foreign_identifier in the form of the direct URL that no longer works :/ but seems it's possible to save them extracting the UUID from the URL.

-- Sample `foreign_identifier` from PhyloPic image
http://phylopic.org/assets/images/submissions/68e12ccc-dbab-4bd9-98c8-6bf33a85b080.1024.png

I think that will require an extra DAG to run a one time pass through PhyloPic to fix those values. What do you think? I can work on that, so we don't start PhyloPic DAG until that is done.

@AetherUnbound
Copy link
Contributor

Honestly, it's only 4k records 😅 I think it would be okay to move them all, especially since they were changed on the upstream site (and since this precludes our ability to have Phylopic displaying on the API I think). But I'm fine either way!

@krysal krysal merged commit 10857e3 into main Mar 30, 2023
@krysal krysal deleted the phylopic_v2 branch March 30, 2023 17:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟥 priority: critical Must be addressed ASAP 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Phylopic to use v2 API Phylopic images are broken
3 participants