!!! This project is DEPRECATED. See 2nd edition at: SimpleChinese2.
Chinese text processing, representation, and visualization.
This package integrates many basic Chinese NLP functions, making Python-based Chinese word processing and information extraction simple and convenient.
- Free software: MIT license
- Documentation: https://simplechinese.readthedocs.io.
To install SimpleChinese, run this command in your terminal:
$ pip install simplechinese
This is the preferred method to install SimpleChinese, as it will always install the most recent stable release.
If you don't have pip installed, this Python installation guide can guide you through the process.
The sources for SimpleChinese can be downloaded from the `Github repo`_.
You can either clone the public repository:
$ git clone git://github.com/chenmingxiang110/simplechinese
Or download the `tarball`_:
$ curl -OJL https://github.com/chenmingxiang110/simplechinese/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
- Read the data from a csv file.
df = pd.read_csv("test.csv")
- Clean the data.
sc.clean(df)
The clean function does the following:
fillna(): Fill the N/As in a pandas.DataFrame with an empty string.
toLower(): Transform alphabets to their lowercases.
remove_punctuations(): Remove all the punctuations in a string or a pandas.DataFrame.
remove_space(): Remove all the spaces in a string or a pandas.DataFrame.
- Extract words from the data
sc.extract_words(sc.clean(df))
- Vectorization
sc.pca(sc.tfidf(sc.clean(df).iloc[:,0]))
- Word cloud
sc.wordcloud(sc.clean(df).iloc[:,0], font_path="yahei.ttc")
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.