Skip to content

SALTIK: An Indonesian Non-Word Error Spelling Correction Dataset

License

Notifications You must be signed in to change notification settings

ir-nlp-csui/saltik

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

SALTIK: An Indonesian Non-Word Error Spelling Correction Dataset

Summary

Saltik is a dataset for benchmarking non-word error correction method accuracy in evaluating Indonesian words. It consists of 58,532 non-word errors generated from 3,000 of the most popular Indonesian words.

Dataset Split

No split.

Changelog

  • 2023-09-01 v1.0
    • Initial dataset

Acknowledgments

  • SALTIK v1.0 was built by Hanif Arkan Audah for his undergraduate thesis at Faculty of Computer Science, Universitas Indonesia in 2023.

References

Please cite the following paper if you use this dataset for your project/publication (status: accepted)

@inproceedings{audah2023,
author = {Audah, Hanif Arkan and Yuliawati, Arlisa and Alfina, Ika},
booktitle = "Proceedings of the ICAICTA 2023",
month = "October",
year = "2023",
address = "Lombok, Indonesia",
publisher = "IEEE",
keywords = {spell checker,non-word error,isolated-word error correction,symspell,edit distance,damerau-levenshtein},
title = {{A Comparison Between SymSpell and a Combination of Damerau-Levenshtein Distance With the Trie Data Structure}},
year = {2023}
}

Licence

You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.

Contact

ika.alfina [at] cs.ui.ac.id

About

SALTIK: An Indonesian Non-Word Error Spelling Correction Dataset

Resources

License

Stars

Watchers

Forks

Packages

No packages published