-
Notifications
You must be signed in to change notification settings - Fork 11
Usage Guide
Neattext is designed to be used either via an object oriented approach or a functional/method oriented approach.
Neattext comes with 4 main class or objects for cleaning text and doing your text preprocessing.These classes include:
TextCleaner: For cleaning text by either removing or replacing the specific noise eg. emails,special characters,numbers,urls,emojis
TextFrame: A Frame-like object that offers a simple API for text preprocessing which inherits all the features of TextCleaner and more
TextExtractor: For extracting certain terms from a text or document
TextMetrics: For checking some basic word statics or metrics such as the count of vowels,consonants,stopwords,etc
>>> from neattext import TextCleaner,TextExtractor,TextMetrics
>>> docx = TextCleaner()
>>> docx.text = "your text goes here"
>>> docx.clean_text()
If you are a fun of functions you can also use neattext
in such a manner using the functions
sub-package. In that case you will have to import as this
>>> from neattext.functions import remove_emails,remove_emojis,clean_text
You can also use the import as feature.
>>> import neattext.functions as ntf
>>> ntf.remove_emails(your_text)
>>>
- Preprocess texts and clean text
>>> import neattext as nt
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>> docx = nt.TextFrame(mytext)
>>> docx.describe()
Key Value
Length : 73
vowels : 21
consonants: 34
stopwords: 4
punctuations: 8
special_char: 8
tokens(whitespace): 10
tokens(words): 14
>>>
>>> docx.head(16)
'This is the mail'
>>> docx.tail(16)
'//example.com π.'
>>>
>>> docx.normalize()
'this is the mail example@gmail.com ,our website is https://example.com π.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '
>>> docx.remove_emojis()
You can also do some basic Natural Language Preprocessing task such as tokenization,ngrams,text generation,etc
>>> docx.word_tokens()
- Clean text by removing emails,numbers,stopwords,emojis,etc
- A simple method for cleaning text by specifying as True/False what to clean from a text.
>>> from neattext.functions import clean_text
>>>
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>>
>>> clean_text(mytext)
'mail example@gmail.com ,our website https://example.com .'
- You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.
- You can choose to remove or not remove punctuations by setting to True/False respectively
>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>>
>>> clean_text(mytext,puncts=False)
'mail example@gmail.com ,our website https://example.com .'
>>>
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
- You can also remove the other non-needed items accordingly
>>> clean_text(mytext,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail example@gmail.com ,our website https://example.com .'
>>>
>>> clean_text(mytext,urls=True)
'mail example@gmail.com ,our website .'
>>>
- You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
- Alternatively you can also remove all known punctuations from a text.
>>> import neattext as nt
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom π Please dont forget the email when you enter ")
>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom π Please dont forget the email when you enter ")
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>> docx.remove_emails()
>>> 'This is the mail ,our WEBSITE is https://example.com π.'
>>>
>>> docx.remove_stopwords()
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com π.'
>>>
>>> docx.remove_numbers()
>>> docx.remove_phone_numbers()
>>> docx.remove_special_characters()
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>> docx.remove_emojis()
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()
- To Extract emails,phone numbers,numbers,urls,emojis from text
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['π']
- To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com π."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
- The MOP(method/function oriented way) Way
>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,True)
>>>'this is the mail <email> ,our website is <url> .'
>>> extract_emails(t1)
>>> ['example@gmail.com']
NeatText Library @JCharisTech