TF-IDF and Sigma analysis written in Python, which outputs results to the convenient *.xlsx spreadsheets for detailed analysis.
TF-IDF analysis allows to detect the most "important" words in the given text of some text corpus (set of articles, etc). These "important" words are those which occur in the particular document more than in any other document of the same text corpus.
While TF-IDF analysis is useful for a set of articles, Sigma analysis is useful to analyze the most "important" words in a single, usually large text (books, documents, etc).
There are a couple of more advanced scripts:
- Matrix output for Gephi in
gephi.py
. Sample output file isgephi.csv
in this repository. - Horizontal visibility graph building with
hor-vis-graph.py
. A couple of sample files are included inhor-vis-graph/
directory. - Other experiments (see below)
Experimental semantic network builder (main concepts from this article):
TF-IDF applied to some news articles text corpus:
Sigma method applied to the book "The Hunger Games":
Analysis of article about Putin with horizontal visibility graph and other articles text corpus:
- Install Python 3, clone the repository, enter repository directory with
cd edu-tf-idf
. - Install required dependencies:
pip3 install -r requirements.txt
. - Place texts to analyze in
/texts
directory (there are a couple already). - Run the analyzer with
py tf-idf.py
command (there are many!).
py tf-idf.py
Result:
Reading texts...
Done! Computing TF-IDF ranks...
Progressing text 2225/2225
Done! Writing results...
Writing worksheet 2225/2225
Done!
Output goes to tf-idf.xlsx
file ready for analysis.
py sigma.py
Result goes to sigma.xlsx
file.
py hor-vis-graph.py
Check the result in hor-vis-graph/
directory, visualize it using Gephi.
Run experimental semantic network builder with
py analyze_text.py texts/news/tech/001.txt
Check the result in analyzed/<text-title>
directory, visualize it using Gephi.