Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Japanese, Chinese and other non-ASCII searching not quite working #901

Open
RickCogley opened this issue Jun 29, 2016 · 20 comments
Open

Japanese, Chinese and other non-ASCII searching not quite working #901

RickCogley opened this issue Jun 29, 2016 · 20 comments
Labels
A-I18n A-Message-Search Searching messages O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@RickCogley
Copy link
Contributor

RickCogley commented Jun 29, 2016

Hi - on a synapse homeserver running on postgresql 9.4, and connecting via vector, I'm having trouble searching Japanese.

If I enter:

バッタと鈴虫

(grasshopper and cricket)
I get a search hit when I search バッタ but, not when I search 鈴虫.

It appears that the beginning of a post is ok, but, anywhere in the middle of the post is not searchable. I tried this on my own homeserver, and on the main matrix homeserver. Same result. Kent on the main forum also reproduced.

Is there a possibility to search using regex?

@RickCogley
Copy link
Contributor Author

I don't know much about this but, this seems to be an extension people use to get full text search on Japaese: http://pgbigm.osdn.jp/pg_bigm_en-1-1.html

@RickCogley
Copy link
Contributor Author

Someone on the mattermost forum mentions how to do it for postgresql.
mattermost/mattermost#2159

@erikjohnston
Copy link
Member

The idea is very much that the search API will accept a "locale" option in the future that handles searching more intelligently in different languages. Unfortunately, this really requires the server knowing which languages are going to be used up front so it can apply the correct indices.

Is there a possibility to search using regex?

Alas not, as it wouldn't be possible to create any indices that would allow it.

@RickCogley
Copy link
Contributor Author

hi @erikjohnston meanwhile, any way I can make it work for my situation - needing to index Japanese and English? --Rick

@erikjohnston
Copy link
Member

I can't think of anything quick that will allow allow both japanese and english, supporting more than one at a time will require a bit of dev work

@RickCogley
Copy link
Contributor Author

Japanese has English interspersed in many cases, so I wonder if just supporting Japanese would also get us the English support. Any good interim ideas on how to do just Japanese, then? --Rick

@erikjohnston
Copy link
Member

Installing Japanese into postgres and then changing all instances of to_tsvector('english', ...) and to_tsquery('english', ...) in synapse/storage/{search, room}.py to point to the Japanese configuration should do it. (Though you may need to change any existing data in the event_search.vector column from 'english' to 'japanese' somehow)

@RickCogley
Copy link
Contributor Author

Ok, so this would basically involve making a customized synapse, correct?

Also - Are there any advanced search operators?

@RickCogley
Copy link
Contributor Author

Additionally, does sqlite have any inherent advantages in terms of this sort of problem?

@RickCogley
Copy link
Contributor Author

@erikjohnston a couple more comments and points -

Thinking of matrix.org as a distributed system, where rooms exist across multiple servers, getting people who use CJK languages (no spaces between to delimit search tokens) to adopt it will be a challenge, if search does not "just work". I am wondering if this is practical or if indeed it will mess up other servers if these rooms with various languages in them are federated to other servers, which don't have the special setup to synapse or postgresql done.

My use case is, largely inviting my employees or clients to the rooms on my homeserver, and getting them to connect via vector or other clients to that home server, which would be set up with the appropriate settings and indexes.

For others doing more federation, it could be a challenge.

Just a thought.

@RickCogley
Copy link
Contributor Author

Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that?

Sincerely,
Rick

@erikjohnston
Copy link
Member

Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that?

Those instructions probably work, but I haven't actually ever tried installing additional languages myself, I just know its possible :) If you do manage it would be great if you share some of the details!

@RickCogley
Copy link
Contributor Author

Ok, of course.

Just got a server to run this locally, since VPSs are generally a bit space constrained for an organization expecting a lot of usage / busy rooms. Most of the entries will be in Japanese so I need to make the search work.

@dkastl
Copy link

dkastl commented Nov 29, 2016

I have the same issue with Japanese full text search, and it's a serious problem. I have learned about PGroonga (http://pgroonga.github.io/), which looks like a good solution.

However, I'm not sure if this would be a practical solution here, and where I had to make changes to make it work. It would be best, if PGroonga could be used if available.

@ara4n ara4n changed the title Japanese searching not quite working Japanese, Chinese and other non-unicode searching not quite working Dec 23, 2016
@proletarius101
Copy link

proletarius101 commented Jan 24, 2021

It shouldn't be very complex (you don't need solutions for C, J, and K respectively). A possible solution could be

And what mattermost do with complicated way: https://docs.mattermost.com/install/i18n.html

Another way is to go with database-wise solutions:

BTW, I use CJK languages, so I'm able to help.
P.P.S. it's still in the scope of Unicode for sure. It's just a non-Latin search problem

@squahtx squahtx added A-Message-Search Searching messages A-I18n labels Sep 6, 2022
@DMRobertson DMRobertson added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely labels Sep 7, 2022
@BLumia
Copy link

BLumia commented Dec 24, 2022

Can confirm we still have issue searching CJK messages. I think we can remove "non-unicode" from the issue title since it's unicode, it's a non-Latin search problem.

@DMRobertson DMRobertson changed the title Japanese, Chinese and other non-unicode searching not quite working Japanese, Chinese and other non-ASCII searching not quite working Jan 9, 2023
@luixxiul
Copy link

luixxiul commented Mar 8, 2023

The issue can be reproduced on Arabic, Hebrew (with symbols), and Hindi too.

@panda2134
Copy link

panda2134 commented Mar 31, 2023

This issue is still relevant. By the way, I don't think this is a minor issue, since the chat history searching functionality is essentially broken in every room using CJK languages. Many matrix rooms using CJK languages belong to the open-source community. Maybe I'm exaggerating, but having no way to search for historical messages makes matrix no better than mailing lists for CJK users, because even with a mailing list you can search with CJK characters (using grep).

@panda2134
Copy link

panda2134 commented Mar 31, 2023

I think implementing the full-text search with Postgres is not suitable for non-Latin languages. We should use things like Apache Solr instead.

However, if we insist on using Postgres for full-text search, then at least we can replace calls to to_tsvector with vectors pre-calculated in Python, using multilingual tokenizing libraries. Database migration might be required in this case.

@bkil
Copy link

bkil commented May 16, 2023

Could we elevate the occurrence from O-Occasional to a higher level, considering that it should be impacting the majority of the world population?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-I18n A-Message-Search Searching messages O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests