-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieve Auckland Museum Image Data #3258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome work, @ngken0995 !
To add the DAG that would run your script at specific intervals, you'll also need to add it to catalog/dags/providers/provider_workflows.py
. Then, if you run just up
, go to 0.0.0.0:9090 (log in with airflow
/airflow
), you can see all of the current DAGs, and you can start the Auckland museum DAG by clicking on the green triangle. Then, you can view the logs from the dag, and can "mark" the "pull_data" step "successful" after several minutes. The DAG will then go on to save the data to a Postgres database on your local machine!
get_record_data
must return all of the required pieces of data, otherwise it must return None
.
You can see the pieces that are required, and the other pieces you can return in the add_item
method of the ImageStore
:
def add_item( |
Currently, when I ran the DAG, I got an error saying TypeError: ImageStore.add_item() missing 2 required positional arguments: 'foreign_landing_url' and 'foreign_identifier'
. So, you need to add foreign_landing_url
(the page for the media item on the provider website) and the foreign_identifier
. For the following item (the first ingested item in the script), I think the foreign_identifier
would be 7109b24c87dbc582327584848d3ee481b2bf5c6e
.
{'creator': 'Auckland War Memorial Museum',
'filesize': 167,
'license_info': LicenseInfo(license='by', version='4.0', url='https://creativecommons.org/licenses/by/4.0/', raw_url='https://creativecommons.org/licenses/by/4.0/'),
'meta_data': {'department': 'ephemera',
'geopos': '',
'type': 'ecrm:E84_Information_Carrier'},
'thumbnail_url': 'http://api.aucklandmuseum.com/id/media/p/7109b24c87dbc582327584848d3ee481b2bf5c6e?rendering=thumbnail.jpg',
'title': 'New Zealand Contemporary Furniture',
'url': 'http://api.aucklandmuseum.com/id/media/p/7109b24c87dbc582327584848d3ee481b2bf5c6e'}
@obulat I was able to find the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic @ngken0995, I'm excited to see you jumping in to adding a new provider! I suspect you might be having trouble running the DAG locally because of the start_date
being in the future (see my comment below). Once you update the start_date
, you'll be able to run it locally and actually get data.
Once you've ingested some data locally, you can see what it looks like by querying your local catalog database. Run just catalog/pgcli
to open pgcli in your terminal, and then you can run sql queries (ie select * from image where provider='aucklandmuseum' limit 10;
).
I was not initially able to ingest any data, because the DAG fails on every image when trying to fetch the file size with a 301. I commented this part out for the sake of testing the rest of the code.
On a higher-level note: the 100,000
number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.
If the API supports date range queries, we could consider making this a dated DAG instead. Otherwise, we might need to be a bit creative. The absolute simplest, somewhat silly solution I can think of is to give it a huge timeout and greatly extend the delay
between requests such that the DAG runs over a week or so, only fetching 100k a day in order to respect the rate limit. I'm not sure what their total dataset size is, so not sure if that's feasible!
# copyright:CC state Creative Commons Attribution 4.0 | ||
return { | ||
"q": "_exists_:primaryRepresentation+copyright:CC", | ||
"size": "100", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the default batch_limit
from the parent class, we can use self.batch_limit
here (and in the increment in the else
statement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max amount of data retrieved from the api is 10,000
. Look at hits -> total -> value
(api) Size
is the amount of data to present in a get request. From
is the index of the total value. We can keep incrementing From
till it reach 10,000
and get_should_continue
function should know when it reach the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that by default self.batch_limit
is 100, and that you can just say:
"size": self.batch_limit,
Rather than hard-coding it separately here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update "size" with batch_limit. Please take a look at the comment below about batch_limit
default value.
catalog/tests/dags/providers/provider_api_scripts/test_auckland_museum.py
Outdated
Show resolved
Hide resolved
|
||
url = information.get("primaryRepresentation") | ||
|
||
thumbnail_url = f"{url}?rendering=thumbnail.jpg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thumbnail their API provides is tiny, fixed width of 70px. @obulat would know best -- should we use this or just default to None
here and use our own thumbnail service? They also have a slightly bigger preview
rendering with a fixed width of 100px.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have previously discussed the thumbnail sizes, and decided against using thumbnails smaller than 600px: #675 (comment)
@stacimc I tried to add the url with |
From a quick look, it does look like the API returns image urls with
Reiterating this point from earlier: it occurred to me as I was looking at this again, is there a reason the |
My apologizes, I should have stated why the size set as a default for 100. I didn't know what the correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for such a great contribution, @ngken0995 ! I ran the DAG locally, and it works well.
I am concerned with the quality of data we collect, though. I got around 60 items locally, and a very large proportion of them either show "Server error" or "Online image not available" image for the main image file.
I think we should check the main url
before saving the item to the catalog for this provider, otherwise we risk downloading a lot of dead links here. What do you think, @openverse-catalog?
Here's a sample of urls I got locally:
+---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------+
| url | foreign_landing_url | title |
|---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------|
| https://api.aucklandmuseum.com/id/media/v/2882 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-18430 | jar |
| https://api.aucklandmuseum.com/id/media/v/117250 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-15507 | glass, wine |
| https://api.aucklandmuseum.com/id/media/v/3191 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-9923 | jar, lidded |
| https://api.aucklandmuseum.com/id/media/v/861840 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-22679 | briefs, pair |
| https://api.aucklandmuseum.com/id/media/v/370276 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-608181 | cartridges |
| https://api.aucklandmuseum.com/id/media/v/325015 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-4375 | teabowl |
| https://api.aucklandmuseum.com/id/media/v/528116 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-1151 | bowl, lidded |
| https://api.aucklandmuseum.com/id/media/v/34541 | https://www.aucklandmuseum.com/collections-research/collections/record/am_naturalsciences-object-368805 | Carex resectans Cheeseman |
| https://api.aucklandmuseum.com/id/media/v/828322 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-90135 | skirt, wool |
| https://api.aucklandmuseum.com/id/media/v/229298 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-61319 | tablecloth, signature |
| https://api.aucklandmuseum.com/id/media/v/75802 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-14260 | cup |
| https://api.aucklandmuseum.com/id/media/v/117280 | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-12592 | goblet |
+---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------+
Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
@obulat which ones did you see that errored? I chose some random ones from the list you shared and they all worked for me. I'm curious to see whether they're present in the wikimedia commons dataset (perhaps they've already sorted through those ones in the Wikimedia dump). |
I think you can get to the main images for the next 2 sections from their landing URLs, but it's not easy to derive their URLs. "Online image not available" placeholderhttps://api.aucklandmuseum.com/id/media/v/3191, https://api.aucklandmuseum.com/id/media/v/2882, Internal server errorhttps://api.aucklandmuseum.com/id/media/v/861840 No image on the landing pagehttps://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-9923 Placeholder image on the landing pageMaybe there's a geographical access problem? It's a good idea to check for these items in Wikimedia. |
I have the same errors and I'm based in the US. |
From what my spouse has told me, the museum is quick to remove public access things for cultural sensitivity reasons, in favour of having controlled access with a culturally relevant approach. I wouldn't be surprised if there are public records for those items where the images aren't available. If we want to check, I'm pretty sure that the museum would respond to an email from us, and if not, I can ask my spouse to get us in touch with someone, they still have connections with folks there. At the very least we could clarify which of these intentionally do not have access so that we 100% do not index them (knowing that they won't ever be available) and which are technical (potentially temporary) access issues, if such a distinction even exists. |
Very interesting, I'm glad you spotted this @obulat! Given the complexity of the data quality questions and the fact that we already have outstanding work required for de-duplicating these results with Wikimedia, I think it would be reasonable to open a separate issue and PR for addressing this. In the meantime, I think this PR could be merged as-is, although the data quality should be addressed before we actually turn the DAG on in production and begin ingesting. This is in line with how we've managed the addition of some other providers that ended up being very complex (eg iNaturalist). We should add a row to the DAG Status page explaining why the DAG is not yet enabled, and prioritize that work separately. |
Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was ready for review 14 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)2. @ngken0995, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
I like this solution, @stacimc, nice to have such a workaround. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding a new DAG, @ngken0995, and all of your patience during the review! It's interesting that this DAG adds a new POST request to the ingester.
Fixes
Fixes #1771 by @obulat
Description
Adds the script to get all the media from aucklandmuseum.com. Currently, there are 10000 images because the search result max amount is 10000 images.
To collect the filesize, this script makes head requests for individual media items, which make the script slower than expected. But it should be okay considering the number of available images.
Image dimensions are not available, so they will need to be collected separately in the future.
Testing Instructions
run
just catalog/test -k auckland_museum
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin