Skip to content

CinthiaS/hybrid-text-summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Study of training approaches of a hybrid summarization model using the metrics ROUGE and NUBIA

Dataset Creation

To create the databaseset, is used the code provided in:

hybrid-text-summarization/src/create_database_train_valid

This code was implemented to automatically download patents from the USPTO database.

The code was divided into five steps:

  1. Extracting codes from subgroups of a given class (searchSubgroups.py)
  2. Extraction of patent links from the USPTO page; (LinksExtract.py)
  3. Download the content of the links; (LinksDownload.py)

At the end of this step, there are files organized in two folders "summary/" "title/" each of the folders have subfolders with the name of the document class, where in these folders they have the document summary of patent (in "abstract/") in .txt format and the title of the document (in "title/") in .txt format.

  1. Database blend (blend_database.py)
  2. Removal of repeated files between documents in subgroups 43 47 52 and 56, removal of files with duplicate content and pre-processing of documents (organize_base.py)

Command to list duplicate files on linux

find . -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 20 > ../duplicate_files.txt

In the folder, hybrid-text-summarization/src/create_database_train_valid/IDs/, you will find the IDs of all groups in which documents were collected.

Hybrid text summarization:

hybrid-text-summarization/notebooks/hybrid_text_summarization.ipynb

SOTA Models

The performance of different State-of-the-art algorithms in task of text summarization was evaluated.

Validation

To validate the results obtained, the ROUGE metrics and the NUBIA metric were used.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published