Skip to content

Flickr results do not use "raw" (human readable) tags #4906

Open

Description

Description

Images ingested into Openverse from Flickr are using Flickr tags in a non-optimal way. Observe the following Openverse result's tags:

https://openverse.org/image/ea4dff9b-7337-47ab-9fac-c9c4bd7860a9

Screenshot from 2024-09-10 11-14-16

As you can plainly see, many of the tags are multi-word phrases that are compressed into single words with spaces removed. For example:

  • thegrapesofwrath => the grapes of wrath
  • cottondress => cotton dress

When viewing the result on Flickr, the tags look correct:

image

So, what is going on?

Well, the search endpoint in Flickr, which we use in our Flickr dag, returns the "cleaned" version of the tags. These are the version used in urls and as identifiers on Flickr, as documented here:

https://www.flickr.com/services/api/misc.tags.html

When querying the single result for an image with Flickr's getImage endpoint, like so:

http https://api.flickr.com/services/rest method==flickr.photos.getInfo api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1 | jq '.photo.tags.tag[].raw'

You can see that the "raw" human-readable tags are avaliable:

"great depression"
"national archives"
"recession"
"depression"
"cardboard house"
"cotton dress"
"poor"
"financial ruin"
"economic disaster"
"sharecroppers"
"the grapes of wrath"
"Tom Joad"
"the crisis"
"le crise"
"la crisis"
"coca-cola"
"1930"
"Farm Security Administration-Office of War Information Collection"
"FSA-OWI"
"Jack Whinery"
"homesteaders"
"Pie Town, New Mexico"
"Evan Lawrence Bench"

It is these tags we should be using in Openverse.

This presents a technical challenge to us in that these tags are only accessible via single results.

Here is the payload for a single tag, from the list of tags returned by getImage:

id	"2045382-2750282427-19380346"
author	"19762676@N00"
authorname	"austinevan"
raw	"Pie Town, New Mexico"
_content	"pietownnewmexico"
machine_tag	0

Edit: I also just noticed that tags.getListPhoto might be a better endpoint to use, as it only returns tags:

http https://api.flickr.com/services/rest method==flickr.tags.getListPhoto api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions