Description
openedon Sep 10, 2024
Description
Images ingested into Openverse from Flickr are using Flickr tags in a non-optimal way. Observe the following Openverse result's tags:
https://openverse.org/image/ea4dff9b-7337-47ab-9fac-c9c4bd7860a9
As you can plainly see, many of the tags are multi-word phrases that are compressed into single words with spaces removed. For example:
- thegrapesofwrath => the grapes of wrath
- cottondress => cotton dress
When viewing the result on Flickr, the tags look correct:
So, what is going on?
Well, the search endpoint in Flickr, which we use in our Flickr dag, returns the "cleaned" version of the tags. These are the version used in urls and as identifiers on Flickr, as documented here:
https://www.flickr.com/services/api/misc.tags.html
When querying the single result for an image with Flickr's getImage endpoint, like so:
http https://api.flickr.com/services/rest method==flickr.photos.getInfo api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1 | jq '.photo.tags.tag[].raw'
You can see that the "raw" human-readable tags are avaliable:
"great depression"
"national archives"
"recession"
"depression"
"cardboard house"
"cotton dress"
"poor"
"financial ruin"
"economic disaster"
"sharecroppers"
"the grapes of wrath"
"Tom Joad"
"the crisis"
"le crise"
"la crisis"
"coca-cola"
"1930"
"Farm Security Administration-Office of War Information Collection"
"FSA-OWI"
"Jack Whinery"
"homesteaders"
"Pie Town, New Mexico"
"Evan Lawrence Bench"
It is these tags we should be using in Openverse.
This presents a technical challenge to us in that these tags are only accessible via single results.
Here is the payload for a single tag, from the list of tags returned by getImage:
id "2045382-2750282427-19380346"
author "19762676@N00"
authorname "austinevan"
raw "Pie Town, New Mexico"
_content "pietownnewmexico"
machine_tag 0
Edit: I also just noticed that tags.getListPhoto might be a better endpoint to use, as it only returns tags:
http https://api.flickr.com/services/rest method==flickr.tags.getListPhoto api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog