Description
Problem
Tags encoding/decoding appears to not work correctly. Tags with diacritics are not rendered in an expected way.
See https://api.openverse.engineering/v1/images/017ee061-3413-4c44-a610-077adf0635dd/ as an example where remédios
is rendered remu00e9dios
in the tags.
Description
I don't know if this is an issue during ingestion or with decoding the output during serialization. Either way, it seems inevitably easiest to fix on the API side as fixing an ingestion bug would require backporting the fix. It does not strike me as an expensive operation to do API side. Perhaps there is an Elasticsearch index option that would fix this as well, but it probably depends on what the root cause of this is.
If these tags are stored incorrectly coded in Elasticsearch then queries against them would not work. Openverse does not currently support querying in languages other than English, which has very few diacritics ("naïve" and other double vowel with umlauts being the only examples I can think of regularly other than loan phrases from French like "vis-à-vis" or "à la" etc). That means that we already expect queries against the vast majority of these terms not to work, but if it is indeed the case that these terms are present in ES incorrectly coded, then we'll need to make a note of that for future work that enables querying in other languages.
The outcome of this issue should be (a) a temporary fix to make the returned data from the API correctly formatted and (b) a new issue (if needed) to fix whatever root cause, if that root cause would cause issues with querying tags with diacritics.
Additional context
I marked this as medium priority because while it's a grave issue, Openverse explicitly does not fully support queries in other languages.
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog