Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload Wizard cannot find 和泉 (杉並区) #5794

Open
nicolas-raoul opened this issue Aug 28, 2024 · 6 comments
Open

Upload Wizard cannot find 和泉 (杉並区) #5794

nicolas-raoul opened this issue Aug 28, 2024 · 6 comments

Comments

@nicolas-raoul
Copy link
Member

nicolas-raoul commented Aug 28, 2024

I took a picture in the 和泉 neighborhood:
https://ja.wikipedia.org/wiki/%E5%92%8C%E6%B3%89_%28%E6%9D%89%E4%B8%A6%E5%8C%BA%29

Because there are many more famous towns and people with the same name, it does not make it to the suggestions:

Screenshot_20240828-105202.png

No surprise so far.

But when I exactly type the full Wikipedia article name 和泉 (杉並区) , I get nothing:

Screenshot_20240828-110311.png

To select the correct depiction, the user has to navigate to the Wikidata item (https://m.wikidata.org/wiki/Q13495859), which ip a pain to do on mobile, and copy its QID then paste it into our app's depiction search textbox:

Screenshot_20240828-144639~2.png

It is not a Japanese-specific issue, it can happen for any language.

Maybe the Wikidata search API has an option to match also via article titles?

If not, it will be a difficult issue to implement, we might have to call an additional different API to get potential Wikidata items via the Wikipedia articles titles. Or we could batch-add article titles as aliasses if it is OK from an editorial point of view.

@mnalis
Copy link
Contributor

mnalis commented Aug 28, 2024

But when I exactly type the full Wikipedia article name 和泉 (杉並区) , I get nothing:

As far as I understand, that is expected, as Commons app does not search Wikipedia for Depicts field, only Wikidata:

@GET("/w/api.php?action=wbsearchentities&format=json&type=item&uselang=en")
fun searchForDepicts(
@Query("search") query: String?,
@Query("limit") limit: String?,
@Query("language") language: String?,
@Query("uselang") uselang: String?,
@Query("continue") offset: String?
): Single<DepictSearchResponse>

and it only fetches 25 elements:

in your case, it searches for this which does not include Q13495859. Even if we increased the limit to (Server maximum) of 50, it still would not be found, because it is somewhere between 50th and 100th match, i.e. here

Or we could batch-add article titles as aliasses if it is OK from an editorial point of view.

Unfortunately I cannot read Japanese script so cannot tell if this specific case would be OK, but if it describes the city by its alternative names, it should be OK. More guidance may be found at: https://www.wikidata.org/wiki/Help:Aliases

However, doing mass-import of data not verified by human being is not likely to be OK (and should definitely first be discussed with wikidata admins even if it sounded like good idea). See https://www.wikidata.org/wiki/Wikidata:Data_Import_Guide for general considerations.
And specifically, for importing from Wikipedia I'd foresee license issues (wikidata being CC0, and Wikipedia mostly CC-BY-SA 4.0, which cannot be imported to CC0)

It is not a Japanese-specific issue, it can happen for any language.

That is correct. For any popular name which has more than 25 matches; if the specific string you search for does not occur in TOP-25, you won't find a match 😢

@mnalis
Copy link
Contributor

mnalis commented Aug 28, 2024

However, as that API supports paginated search, it could be supported similar to idea proposed for categories search here #3179 (comment) in second bullet point, i.e. add Load more button at the bottom of results, so:

  • first search would look for limit=25&continue=0&search=和泉
  • click on Load more would look for limit=25&continue=25&search=和泉, and append that to the list of searches
  • next click on Load more would look for limit=25&continue=50&search=和泉, and append that to the list of searches
  • next click on Load more would look for limit=25&continue=75&search=和泉, and append that to the list of searches

(etc. you get the idea, but your match from this specific issues would already be found)

That way, you'd be able to find your popular search term in all cases.


Alternatives to Load more...:

  • auto-load on scroll down (perhaps more seamless, but less discoverable I think), or
  • clearing the list completely and always having just 25 results shown, but instead of Load more having Previous and Next buttons (greyed-out if not available, of course) at the bottom

@nicolas-raoul
Copy link
Member Author

@mnalis Thanks for the link https://www.wikidata.org/wiki/Help:Aliases !
This use case is not described, but not outright banned either... would you mind asking on the talk page?
If implementing this via an additional API call, the number of results will be small (most likely 0 or 1) so just appending it to the existing results is fine even without paging.

An English equivalent could be Paris, Texas: https://www.wikidata.org/wiki/Q830149
Interestingly this one has "Paris, Texas" as an alias, presumably because people sometimes actually say "Paris, Texas" in normal conversations. That is not true for many other concepts, such as "Spring (hydrology)". Also, due to opposite grammatical order, nobody would say or write 和泉(杉並区) they would use 杉並区和泉.

@mnalis
Copy link
Contributor

mnalis commented Aug 29, 2024

This use case is not described, but not outright banned either...

If you are talking about "we could batch-add article titles as aliasses" as the idea here, It looks like it is prohibited by step 1. of that import guidelines that I linked to.

If you are however talking about fixing this one specific example only, It would be best if you asked about it (I'm don't even read the script, much less can translate it or weigh its nuances)

If implementing this via an additional API call, the number of results will be small (most likely 0 or 1) so just appending it to the existing results is fine even without paging.

Perhaps, if we use some third API for searching wikipedia articles for exact title. But note that it would likely rarely help, as wikipedia titles are finicky, and IIRC user would somehow have to specify which wikipedia language to search in advance.

e.g. I don't think that searching for "Thành phố Hồ Chí Minh" in titles of English Wikipedia would work, and searching for "Ho Chi Minh" on English wikipedia won't work either if you're only matching on exact title -- as article is named "Ho Chi Minh City"; and if you go after partial title results, then there will be much more than 0 or 1 results (there are likely many articles starting with "Ho"), and you'd still need to do paging (more complex when you need to page two different APIs at the same time!)

Given that just paging on your original query would've solved the issue issue, I think that should be first step anyway (as you'd likely have to implement it anyway for more complex solutions too)

@whym
Copy link
Collaborator

whym commented Aug 29, 2024

In the Wikidata website's search results, the town is at the 3rd place for 杉並 和泉 (or Suginami[space]Izumi). So just adding some right terms for disambiguation seems to help, and that's what I would do manually, if I don't find it with just 和泉 (Izumi).

More broadly, perhaps we could filter and rerank the raw search results to prominently show items that are more likely to be the target of depict. In this case, the same term (in written form) 和泉 can refer to family names, but names are not much ikely to be depicted in a photo:
https://www.wikidata.org/wiki/Q26216237
https://www.wikidata.org/wiki/Q26216228

Izumi (Q13495859) has location, so we could theoretically check how close it is to the user's location and use that to boost it in reranking.

@nicolas-raoul
Copy link
Member Author

Great finding!

We should use the same search API URL as the desktop website, it gets results where we get nothing.

This sounds much easier to implement than the solutions we had considered above.

The proximity idea is great for a subsequent phase.

Search results from the desktop website:

Screenshot_20240829-221016.png

Screenshot_20240829-220701.png

Screenshot_20240829-220622.png

Surprisingly the mobile website's search is not good:

Screenshot_20240829-220409~2.png

Screenshot_20240829-220430~2.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants