Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual Search with Subword TF-IDF #38

Open
artitw opened this issue Oct 1, 2022 · 1 comment
Open

Multilingual Search with Subword TF-IDF #38

artitw opened this issue Oct 1, 2022 · 1 comment
Assignees

Comments

@artitw
Copy link
Owner

artitw commented Oct 1, 2022

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing.

@artitw
Copy link
Owner Author

artitw commented Oct 1, 2022

Would be great to get feedback on the work so far.

Paper: https://arxiv.org/abs/2209.14281
Demo: https://colab.research.google.com/drive/1RaWj5SqWvyC2SsCTGg8IAVcl9G5hOB50?usp=sharing

@artitw artitw self-assigned this Dec 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant