Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: categories sometimes missing in search (while visible in manual API calls!) #5749

Open
mnalis opened this issue Jun 3, 2024 · 12 comments

Comments

@mnalis
Copy link
Contributor

mnalis commented Jun 3, 2024

Summary

While doing the research for #3179 (comment) I've noticed that manually run API calls do not always return same as I see in the commons app.

According to @sivaraam for searching this API should be used in such case:

@GET("w/api.php?action=query&format=json&formatversion=2&generator=allcategories&prop=categoryinfo|description|pageimages&piprop=thumbnail&pithumbsize=70")
fun searchCategoriesForPrefix(
@Query("gacprefix") prefix: String?,
@Query("gaclimit") itemLimit: Int,
@Query("gacoffset") offset: Int
): Single<MwQueryResponse>

but when I call that API manually (see "Expected behaviour" section), I get some elements missing (i.e. 2 results instead of 3). Could someone help debug this?


Probably unrelated, but API also returns warning "Unrecognized parameter: gacoffset." so it seems like we shouldn't be using that anymore; but does not seem like it should be the cause of the issue.

Steps to reproduce

  1. start uploading image and proceed to "Step 3" (category selection)
  2. type Olea (s
  3. wait for category results to display

Expected behaviour

3 categories displayed:

  • "Category:Olea (seedlings)"
  • "Category:Olea (surname)"
  • "Category:Olea (ship, 1981)"

as that is what API call seems to return when run manually:
https://commons.wikimedia.org/w/api.php?format=json&action=query&formatversion=2&generator=allcategories&prop=categoryinfo|description|pageimages&piprop=thumbnail&pithumbsize=70&gaclimit=25&gacoffset=0&gacprefix=Olea+(s

Actual behaviour

Only 2 categories displayed (See screenshot):

  • "Category:Olea (seedlings)"
  • "Category:Olea (surname)"

Device name

Samsung Galaxy S23+

Android version

Android 14 (OneUI 6.1)

Commons app version

5.0.1~af028cbdd (latest F-droid)

Device logs

nothing worthwhile (i.e. nothing mentioning the search or API calls or "Olea"...)

Screen-shots

small_Screenshot_20240603_205048_Commons

Would you like to work on the issue?

None

@mnalis mnalis added the bug label Jun 3, 2024
@sivaraam
Copy link
Member

sivaraam commented Jun 4, 2024

Nice observation. Thanks for opening this issue, @mnalis !

As you note, the warning about gacoffset is not a problem. The problem is this code-snippet in our codebase where we intentionally filter out categories that have a year in them. I don't think we aimed to filter out valid categories like this one. The intention was only to filter out the invalid ones. Looks like this one got filtered out as a side effect.

We should explore tweaking the logic to avoid this.

@mnalis
Copy link
Contributor Author

mnalis commented Jun 4, 2024

That API call seems to return boolean field named hidden, so I first wondered wouldn't just skipping categories marked as hidden instead be enough? Quick check seems to indicate that Media needing categories mentioned in code snippet has hidden: true. Although Photographs taken on does not (for whatever reason, it definitely looks like it should be hidden category)... So that is dead end.


However #1029 is another problem; those categories should not be hidden, but they might not be useful for upload of new pictures. However, I find such overreaching regex as .*(19|20)\\d{2}.* (which filters out anything containing 4digit year in this and previous century) is waaay too aggressive.

I do not find in that ticket the reasoning to exclude everything resembling a year (closest is to filter out things "in the 1960s", but that has separate regex .*0s.*, and always has s after 0 at the end to indicate decade).

I would suggest just removing regex .*(19|20)\\d{2}.* unless someone (@nicolas-raoul, you opened that issue initially, do you remember any such cases?) can provide examples when category search is producing irrelevant categories containing years for the users, in which case I'd write separate (much more specific / with less false positives) regex.

While there I'd also update .*(200|201)0s.* regex to cover 2020s too (i.e. .*20[0-2]0s.*), as that has become a thing lately.

(unless someone else would like to do their first code contribution, I could also make a PR for that -- if people agree with the reasoning above)

@nicolas-raoul
Copy link
Member

Categories with years were really polluting category suggestions before that regexp. Whatever you typed, you would get 20 times the same result with different year, burying all but one useful result.

The ship being unduly filtered is a sad side-effect, but a quite rare occurrence.

App-side filtering could be a solution.

@HrithikPadavala
Copy link

@mnalis could you please assign me for this task

@nicolas-raoul
Copy link
Member

@HrithikPadavala We have not yet reached a conclusion about what strategy to use. Would you mind trying another bug? Thanks! :-)

@HrithikPadavala
Copy link

@nicolas-raoul sure, thanks

@sivaraam
Copy link
Member

sivaraam commented Jun 9, 2024

@HrithikPadavala There are some issues which we explicitly need for the upcoming 5.0.2 release. You could check them here: https://github.com/commons-app/apps-android-commons/milestone/11

Feel free to work on something that you may find interesting there 🙂

@sivaraam
Copy link
Member

sivaraam commented Jun 9, 2024

That API call seems to return boolean field named hidden, so I first wondered wouldn't just skipping categories marked as hidden instead be enough? Quick check seems to indicate that Media needing categories mentioned in code snippet has hidden: true. Although Photographs taken on does not (for whatever reason, it definitely looks like it should be hidden category)... So that is dead end.

Yeah. That is unfortunate. Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?

However #1029 is another problem; those categories should not be hidden, but they might not be useful for upload of new pictures. However, I find such overreaching regex as .*(19|20)\\d{2}.* (which filters out anything containing 4digit year in this and previous century) is waaay too aggressive.

I could understand this.

I do not find in that ticket the reasoning to exclude everything resembling a year

I think that's because the year filtering has been in place even before that ticket. The relevant discussion could be seen in #47.

(closest is to filter out things "in the 1960s", but that has separate regex .0s., and always has s after 0 at the end to indicate decade).

It seems to me like #47 only focused on categories other than "in the 1990s" etc. So, its tough to conclude if removing the year filter altogether would be an ideal solution. Let's discuss to see if we could come up with a better solution to this problem.

While there I'd also update .*(200|201)0s.* regex to cover 2020s too (i.e. .*20[0-2]0s.*), as that has become a thing lately.

That's a good suggestion.

(unless someone else would like to do their first code contribution, I could also make a PR for that -- if people agree with the reasoning above)

I think for now you could feel free to raise a separate PR for updating the .*(200|201)0s.* regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂

@mnalis
Copy link
Contributor Author

mnalis commented Jun 24, 2024

Yeah. That is unfortunate. Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?

Perhaps. I'm not sure where to request that though, and it would not really remove the need to filter on our side (as I noted above not all problematic categories should be hidden) or simplify code - it would just reduce a number of regexes slightly. Given that regexes are something I can write well at least, it's probably not a priority (unless I'm missing something).

The ship being unduly filtered is a sad side-effect, but a quite rare occurrence.

I would disagree that is rare. For example it includes at least following categories:

  • ships (most if not all of them -- like the Olea (ship, 1981) above)
  • airplanes, cars, tanks, trucks, trams and basically all other vehicles - e.g. 1954 Maserati* series
  • railway infrastructure - e.g. Okehampton railway station (1997 to 2019)
  • bridges - e.g. North Avenue Bridge (Chicago, 1907 bascule)
  • buildings - e.g. Deutsche Bibliothek (1959 Frankfurt building)
  • weaponry - e.g. 102 35 gun 1914 model
  • sports - e.g. 1980 Chicago White Sox season
  • events - e.g. 1981 European Council in Maastricht
  • movies - e.g. 101 Dalmatians (1996 film)
  • books - e.g. "We" (1927 book)
  • most of the other art too (pictures, scultpures, etc) - e.g. Sacred Heart of Jesus (1960 sculpture)
  • documents - e.g. 1981 stock certificates from Germany
  • roadworks - e.g. 1981 tram track reconstructions in Prague
  • numismatics - e.g. Reichsmark banknotes in 1948 of the Soviet occupation zone
  • philatelists - e.g. Russian Revolution of 1905 on stamps

and probably many more (i.e. any human-built thing that was built to survive more than few years).

It seems to me like #47 only focused on categories other than "in the 1990s" etc. So, its tough to conclude if removing the year filter altogether would be an ideal solution. Let's discuss to see if we could come up with a better solution to this problem.

Ah, thanks for the pointer, that reference helps a lot! While I can see the reasoning, the issue is that such filter then leaves out way too much.

  • Perhaps the solution might be not to filter out everything that contain a year in a category name, but to instead sort it down to the end of the list?

    That way, people who are finding their result earlier in the list won't be bothered by them, and those who are desperately trying to find a category they are looking for will have to scroll to the bottom of the list, but would at least find their match there.

  • Alternative idea (building on idea in Make category search non case-sensitive and more user friendly #3179 (comment)) is to filter out categories with year initially, but it user does not find results and presses manual Load more... button, than we not only do extra API call, but also un-filter results we initially filtered (as it is obvious user is looking for any match to their search at that point).

I think for now you could feel free to raise a separate PR for updating the .(200|201)0s. regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂

Done that part in #5761

@sivaraam
Copy link
Member

sivaraam commented Jul 2, 2024

Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?

Perhaps. I'm not sure where to request that though, and it would not really remove the need to filter on our side (as I noted above not all problematic categories should be hidden) or simplify code - it would just reduce a number of regexes slightly. Given that regexes are something I can write well at least, it's probably not a priority (unless I'm missing something).

It's not a blocker certainly. I was mostly suggesting that as a way to start discussions on this so that we could at some future point of time reduce the amount of regex we need to filter out categories. It does take time to arrive at a consensus with the community 🙂

I would disagree that is rare. For example it includes at least following categories:

[ ... snip ... ]

and probably many more (i.e. any human-built thing that was built to survive more than few years).

Ah. That's a big list. Thanks for taking the time to mention the potential categories that we're filtering out accidentally as a consequence of that regex. It does seem like we should do something to improve the situation.

  • Perhaps the solution might be not to filter out everything that contain a year in a category name, but to instead sort it down to the end of the list?
    That way, people who are finding their result earlier in the list won't be bothered by them, and those who are desperately trying to find a category they are looking for will have to scroll to the bottom of the list, but would at least find their match there.

That's a good idea. We could certainly try this!

  • Alternative idea (building on idea in Make category search non case-sensitive and more user friendly #3179 (comment)) is to filter out categories with year initially, but it user does not find results and presses manual Load more... button, than we not only do extra API call, but also un-filter results we initially filtered (as it is obvious user is looking for any match to their search at that point).

This is also a fine idea but I believe the sorting down approach would be a bit more tangible as a quick remedy to the situation. If anyone is willing to achieve it this way, feel free to chime in.

I think for now you could feel free to raise a separate PR for updating the .(200|201)0s. regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂

Done that part in #5761

Great. Thanks for that! I've left a comment there. Kindly check the same. Once that's resolved, it should be good to go.

@sivaraam
Copy link
Member

@mnalis I just discovered that there's this long standing PR of mine #4902 which actually proposes to stop filtering out all categories that have an year in it. There's already some related discussion in issue #4901 too. Specifically, there's some consensus on what we should and should not filter out [ref].

So, I'm starting to do you think if we could just take forward a combination of your changes (#5761) and the one proposed in #4902 in the next release and see how it goes 🤔

@mnalis
Copy link
Contributor Author

mnalis commented Jul 14, 2024

@sivaraam sounds great to me; I'm all for it. 🚀

If we find out that few more unneeded categories start to pop out, we can tune the regexes to filter out only those problematic/spammy categories, instead of everything containing a year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants