Skip to content

Normalisasi teks atau preprocessing untuk data media sosial

Notifications You must be signed in to change notification settings

yasirutomo/text-normalization

Repository files navigation

Text Normalization / Preprocessing Module

This is a text normalization or preprocessing for social media data. Sample text in this project are tweets in Indonesian Language, however it is possible to process other language. This project is build using Python 3.4.

What is inside

  • Enter / new line ('\n') normalization
  • Lowercase normalization ('Makan' -> 'makan')
  • Reapeted dot (Social media data have it. eg. 'lets eat yeah.....')
  • Link or URL normalization (erase the 'http://blabla' or 'https://blabla')
  • Repeated character normalization ('lets eat yeah ::)' -> 'lets eat yeah :)')
  • Elepsis normalization (erase the '…', if you using json dataset that have unicode use: 'text = "".join([x for x in text if ord(x)<128])')
  • Tokenization (sentence and word)
  • Spelling check (correct spell is in /resources/spellcheck.txt, change the data in it for better cheker and other language)
  • Reapeted word that has meaning ('malam malam' -> 'malam-malam', this is optional if needed, earase if the case not need this)
  • Emoticon normalization (': - )' -> ':-)')

In addition, stemming process for Indonesian language can be done using Sastrawi Package (https://pypi.python.org/pypi/Sastrawi/1.0.1).

Have fun.

About

Normalisasi teks atau preprocessing untuk data media sosial

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages