Multilingual Search with Subword TF-IDF #38

artitw · 2022-10-01T20:37:07Z

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing.

artitw · 2022-10-01T20:59:54Z

Would be great to get feedback on the work so far.

Paper: https://arxiv.org/abs/2209.14281
Demo: https://colab.research.google.com/drive/1RaWj5SqWvyC2SsCTGg8IAVcl9G5hOB50?usp=sharing

artitw self-assigned this Dec 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual Search with Subword TF-IDF #38

Multilingual Search with Subword TF-IDF #38

artitw commented Oct 1, 2022

artitw commented Oct 1, 2022

Multilingual Search with Subword TF-IDF #38

Multilingual Search with Subword TF-IDF #38

Comments

artitw commented Oct 1, 2022

artitw commented Oct 1, 2022