Text Summarization Model

Substractive Summarization

view code : Seq2Seq with Bahdanau Attention
- Environment : macOS BigSur(python 3.8, tensorflow 2.4.1, CPU)
- Dataset : Amazon Fine Food Reviews from Kaggle(lang: English)
- Pre-processing : tokenizing(nltk), padding, tagging, word embedding(dim=128)
- Model
  - Softmax approximation, Dropout, LSTM(3 encoder layer, Embedding Layer, Attention Layer, Concatenate Layer(to concatenate attention and decoder hidden state)
  - Attention mechanism : Bahdanau Attention
- Performance
  - val_loss : 1.8389(20 epochs)
  - Sample Text : this is a tough review to write because this pr oduct is average in every way you follow the instructions and you end up with macaroni and cheese not bad but no thing special
  - Predicted Summary of the Sample : not as good as i expected
  - Actual Summary of the Sample : average in every way
view code : Transformer with MultiHead Attention
- Environment : Google Colab(python 3.7, tensorflow 2.0+, GPU)
- Dataset : Amazon Fine Food Reviews from Kaggle(lang : Engligh) / Glove pre-trained vectors(to initialize word embedding)
- Pre-processing : tokenizing(nltk), padding, tagging, word embddding(dim=300)
- Model
  - Architecture
  - GELU approximation, Dropout, Layer normalization Funtion
  - Sine-Cosine Positional Encoding
  - Attention Mechanism : Multihead Attention Mechanism
- Performance
  - Average Validation BLEU : 0.1078(10 epochs)
  - Sample Text : great price and fast shipping ! and i enjoy that u do not have to drive so far away to get that stuff ! it is so good to but it from online ! ! it was still fresh and good
  - Predicted Summary of the Sample : great coffee
  - Actual Summary of the Sample : fresh n good
view code : BigBird with Sparse Attention
- Environment : Google Colab(python 3.7, tensorflow 2.0+, GPU, 25GB RAM)
- Dataset : scientific_papers/pubmed(pretrained), cnn_dailymail(test) from TFDS
- Model(used Saved Model in this repository)
- GELU approximation, Dropout, Layer normalization Function
- Sine-Cosine Positional Encoding
- Attention Mechanism : Multihead Attention Mechanism, Big Bird Attention(block_sparse setting : set random block size with block_size, fix window block size as 3 and global block size as 2)
- Decoder : Beam Search
- Performance
  - Rouge Score : high fmeasure = 0.0826, low precision=0.0622(10 epochs)
  - In colab 25GB RAM environment, training model with batch size as 2 crashed my session and batch size as 16 ran out of memory. So I used google bigbird pretrained model.
  - In paper, performance with CNN_dailymail as test dataset for shoreter summarization was pretty good(R-L BIGBIRD-Pegasus: 40.74, BIGBIRD-ROBERTa: 36.61), but the saved model(BIGBIRD-Pegasus) seems doesn't.
  - Pretrained model trained with PubMed dataset, so I think BigBird improves performance when testing this model with dataset which is same as the training dataset.
  - It seems relavant to Pegasus model. The paper, 'PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization', says the model lead had decent performance on the two news datasets but was significantly worse on the two non-news datasets, which agrees findings of lead bias in news datasets.
  - Test dataset : PubMed
  - Test dataset : CNN_dailymail

Extractive Summarization

view code : TextRank Algorithm
Environment : macOS BigSur(python 3.7, tensorflow 2.4.1, CPU)
Dataset : newspaper3k(lang: Korean)
Pre-Processing : tokenizing(konlpy), removing stopwords
Considering TF-IDF, graph and TextRank Algorithm
Performance
- 원문 : 이에 따르면 보통 사람은 뇌의 10%를 사용할까 말까 한데 아인슈타인은 30%나 사용했다고 한다. 어떤 버전에서는 아인슈타인이 직접 그렇게 밝히기도 한다(?). 한때 학교에서도 들을 수 있던 믿거나 말거나 식의 이야기가 자취를 감추게 된 건 다행한 일이다. 과학자들에 의하면 우리 뇌는 10%만 사용되기는커녕 거의 언제나 100% 가동 중이다. 더구나 뇌는 막대한 유지비가 드는 비싼 기관이므로 90%를 사용 안 하고 놀려 둔다는 것은 진화론적인 관점에서도 있을 수 없는 일이다.
- 요약문 : 이에 따르면 보통 사람은 뇌의 10%를 사용할까 말까 한데 아인슈타인은 30% 나 사용했다고 한다. 믿거나 말거나 식의 이야기가 자취를 감추게 된 건 다행한 일이다. 과학자들에 의하면 우리 뇌는 10% 만 사용되기는 커 녕 거의 언제나 100% 가 동 중이다.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.ipynb_checkpoints		.ipynb_checkpoints
dataset		dataset
.DS_Store		.DS_Store
BigBird_UseSavedModel.ipynb		BigBird_UseSavedModel.ipynb
BigBird_UseSavedModel_pubmed.ipynb		BigBird_UseSavedModel_pubmed.ipynb
README.md		README.md
Text Summarization with Seq2Seq model_kaggle.ipynb		Text Summarization with Seq2Seq model_kaggle.ipynb
TextRank_kor.ipynb		TextRank_kor.ipynb
Transformer_Data Pre-Processing.ipynb		Transformer_Data Pre-Processing.ipynb
Transformers_summarization.ipynb		Transformers_summarization.ipynb
seq2seq_summarization_bdnau.ipynb		seq2seq_summarization_bdnau.ipynb
stopwords.xlsx		stopwords.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Summarization Model

Substractive Summarization

Extractive Summarization

About

Uh oh!

Releases

Packages

Languages

jyshin0926/TextSummarization

Folders and files

Latest commit

History

Repository files navigation

Text Summarization Model

Substractive Summarization

Extractive Summarization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages