-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Japanese, Chinese and other non-ASCII searching not quite working #901
Comments
I don't know much about this but, this seems to be an extension people use to get full text search on Japaese: http://pgbigm.osdn.jp/pg_bigm_en-1-1.html |
Someone on the mattermost forum mentions how to do it for postgresql. |
The idea is very much that the search API will accept a "locale" option in the future that handles searching more intelligently in different languages. Unfortunately, this really requires the server knowing which languages are going to be used up front so it can apply the correct indices.
Alas not, as it wouldn't be possible to create any indices that would allow it. |
hi @erikjohnston meanwhile, any way I can make it work for my situation - needing to index Japanese and English? --Rick |
I can't think of anything quick that will allow allow both japanese and english, supporting more than one at a time will require a bit of dev work |
Japanese has English interspersed in many cases, so I wonder if just supporting Japanese would also get us the English support. Any good interim ideas on how to do just Japanese, then? --Rick |
Installing Japanese into postgres and then changing all instances of |
Ok, so this would basically involve making a customized synapse, correct? Also - Are there any advanced search operators? |
Additionally, does sqlite have any inherent advantages in terms of this sort of problem? |
@erikjohnston a couple more comments and points - Thinking of matrix.org as a distributed system, where rooms exist across multiple servers, getting people who use CJK languages (no spaces between to delimit search tokens) to adopt it will be a challenge, if search does not "just work". I am wondering if this is practical or if indeed it will mess up other servers if these rooms with various languages in them are federated to other servers, which don't have the special setup to synapse or postgresql done. My use case is, largely inviting my employees or clients to the rooms on my homeserver, and getting them to connect via vector or other clients to that home server, which would be set up with the appropriate settings and indexes. For others doing more federation, it could be a challenge. Just a thought. |
Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that? Sincerely, |
Those instructions probably work, but I haven't actually ever tried installing additional languages myself, I just know its possible :) If you do manage it would be great if you share some of the details! |
Ok, of course. Just got a server to run this locally, since VPSs are generally a bit space constrained for an organization expecting a lot of usage / busy rooms. Most of the entries will be in Japanese so I need to make the search work. |
I have the same issue with Japanese full text search, and it's a serious problem. I have learned about PGroonga (http://pgroonga.github.io/), which looks like a good solution. However, I'm not sure if this would be a practical solution here, and where I had to make changes to make it work. It would be best, if PGroonga could be used if available. |
It shouldn't be very complex (you don't need solutions for C, J, and K respectively). A possible solution could be
And what mattermost do with complicated way: https://docs.mattermost.com/install/i18n.html Another way is to go with database-wise solutions:
BTW, I use CJK languages, so I'm able to help. |
Can confirm we still have issue searching CJK messages. I think we can remove "non-unicode" from the issue title since it's unicode, it's a non-Latin search problem. |
The issue can be reproduced on Arabic, Hebrew (with symbols), and Hindi too. |
This issue is still relevant. By the way, I don't think this is a minor issue, since the chat history searching functionality is essentially broken in every room using CJK languages. Many matrix rooms using CJK languages belong to the open-source community. Maybe I'm exaggerating, but having no way to search for historical messages makes matrix no better than mailing lists for CJK users, because even with a mailing list you can search with CJK characters (using |
I think implementing the full-text search with Postgres is not suitable for non-Latin languages. We should use things like Apache Solr instead. However, if we insist on using Postgres for full-text search, then at least we can replace calls to |
Could we elevate the occurrence from |
Hi - on a synapse homeserver running on postgresql 9.4, and connecting via vector, I'm having trouble searching Japanese.
If I enter:
バッタと鈴虫
(grasshopper and cricket)
I get a search hit when I search バッタ but, not when I search 鈴虫.
It appears that the beginning of a post is ok, but, anywhere in the middle of the post is not searchable. I tried this on my own homeserver, and on the main matrix homeserver. Same result. Kent on the main forum also reproduced.
Is there a possibility to search using regex?
The text was updated successfully, but these errors were encountered: