Text Normalization / Preprocessing Module

This is a text normalization or preprocessing for social media data. Sample text in this project are tweets in Indonesian Language, however it is possible to process other language. This project is build using Python 3.4.

What is inside

Enter / new line ('\n') normalization
Lowercase normalization ('Makan' -> 'makan')
Reapeted dot (Social media data have it. eg. 'lets eat yeah.....')
Link or URL normalization (erase the 'http://blabla' or 'https://blabla')
Repeated character normalization ('lets eat yeah ::)' -> 'lets eat yeah :)')
Elepsis normalization (erase the '…', if you using json dataset that have unicode use: 'text = "".join([x for x in text if ord(x)<128])')
Tokenization (sentence and word)
Spelling check (correct spell is in /resources/spellcheck.txt, change the data in it for better cheker and other language)
Reapeted word that has meaning ('malam malam' -> 'malam-malam', this is optional if needed, earase if the case not need this)
Emoticon normalization (': - )' -> ':-)')

In addition, stemming process for Indonesian language can be done using Sastrawi Package (https://pypi.python.org/pypi/Sastrawi/1.0.1).

Have fun.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
modulenorm		modulenorm
resources		resources
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
normalisasitext.py		normalisasitext.py
tweetdata.csv		tweetdata.csv
tweetdata_normalized.txt		tweetdata_normalized.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Normalization / Preprocessing Module

What is inside

About

Releases

Packages

Languages

yasirutomo/text-normalization

Folders and files

Latest commit

History

Repository files navigation

Text Normalization / Preprocessing Module

What is inside

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages