Releases: minasmz/Persian-Summarization
v1.0.0
Topic Clustering Using Doc2Vec
By running this code on an arbitrary input text file (input.txt) you can cluster document paragraphs by their topics and then return a summary of each cluster.
For doing this you should train a doc2vec model on paragraphs of training set and put it in the project by the name (my_model_parags_from_wikiAggregate.doc2vec) then you can obtain vector of each input paragraph and calculate cosine similarity between each two paragraph in a row and if their similarity was more than a calculated threshold they assigns to a same cluster. After this we can obtain summary of each cluster separately and summarization does not miss important topics of each input text by this way.
Keyword Extraction Added
In this release I have added keyword extraction to extract most important and frequent n-gram words in a text.
For doing this I use two approaches and both of them are accurate on variety of corpuses I used, but since I do not access to a gold standard set I did not got any result to publish.
In this code you can give it a big text file as input, it clusters every topic in a row and summarizes and returns keywords of any cluster.
For keyword extraction, one of the approaches which has been used as default (with_embeded=False) is using some kind of tf-idf which used in Gensim text summarization and the name is bm25.py. I added this code to my script with some alternation. This can return most important words in each text as input. The second approach (with_embeded=True) uses a word2vec model and creates a graph of words and the similarity between them is the weight of the edges between them, then apply a textRank algorithm which implementation is in Gensim library and I added the code in my script with some changes. It returns most important words with considering the meaning of other words in the input text.
After acquiring most important and frequent words it checks them in unigram to n-gram of the words (default n is 10) if frequency of n-gram be more than half of the frequency of the word and its occurrence be more than 2 (in big size inputs this number should be increased) the important word occurred in the text would reduce to the bigger n-gram and Finally returns the most important n-gram that important words are in them.