Skip to content

hayleyteng/IR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Extraction & Knowledge Elicitation for UN General Assembly (UNGA) resolutions

Currently, UN organizations and UN-related organizations produce, process, and maintain a high volume of documents, and the reports are initially designed for humans to read, process, and generated insights for decision-making.

The massive number of documents and the current document format (.doc or .pdf) create challenges for both the UN information management system and decision-makers in the management. Extract keywords from materials in the original forms is both time-consuming and labor-intensive.

This project aims to transform the documents into the "machine-readable" format, identify critical information and knowledge, improve information processing efficiency with automation, and conduct analysis for insights discovery. Specifically, the main objectives of the information system project are to generate machine-readable and semantically enhanced documentation automatically for 1) document retrieval tool; 2) metadata and document description query; 3) text mining for content analysis.

1. Metadata with Regular Expression

The Metadata was crawled from the UN General Assembly (UNGA) resolutions: https://www.un.org/en/sections/documents/general-assembly-resolutions/

To locate the critical information in the document, we display the high-level information under a structured format. The system chooses Regular Expression (RegEx) based on the existing structure of each document as RegEx allows to check a series of characters for “matches” with efficiency and adaptability.

The task consists of two parts: fields extraction and basic segementation.

1.1 Metadata for documents

The sample extracted metadata fields are shown below.
Title and Closing Formula, which are not necessaily metadata, are also included.

  • Doc Name:N1846596
    Note*:Doc Name does not need extraction. It is set as index.
  • ID:A/RES/73/277
  • Session: Seventy-third Session
  • Agenda Items:148
  • Proponent Authority:The Fifth Committee
  • Approval Date: 2018-12-22
  • Title
    • Financing of the International Residual Mechanism for Criminal Tribunals
  • Closing Formula
    • 65th plenary meeting
    • 22 December 2018

The result is shown as follows:

images

1.2 Paragraph Segmentation

We extract operative, preamble, annex and footnote information, which would be crucial for further content analysis.
The figure below shows example of the N1643743.doc with 'op' = operative:

15-images

The figure below shows example of the N1643743.doc with 'pre' = preamble:

16-images

The figure below shows example of the N1643743.doc with 'ax' = annex:

17-images

The figure below shows example of the N1643743.doc with 'fn' = footnote:

2-images

2. Task-based information extraction

This part consists of document abbreviation, deadlines extraction, references extraction and database filtering.
These tasks are based on the first part, and are required with higher precision.

2.1 Document abbreviation

The abbreviation is only done for operative paragraphs.
The sample output is shown as follows. Words in red belong to wrong abbreviations.
The testing accuracy for this task is 0.88.

14-images

2.2 Refences and deadlines

Here our goal is to find out past resolutions and future dates.
We can make more precise matches thanks to their specific formats.
Sample outputs are shown as follows:

Refences:

  >>>b.refence(file)
  ['resolutions 1980/67 1989/84',
  'resolution 69/313',
  'resolutions 53/199 61/185',
  'decision XIII/5',
  'decision 14/30',
  'decision XII/19',
  'resolution 70/1',
  'decision 14/5']

Referred resolutions:

  >>>b.refered_doc(file,df)
  ['N1523222', 'N0650553', 'N1529189']

Future Date and Year:

  >>>b.future_date(file)
  (['8 June 2020', '11 June 2020'], ['2030', '2020', '2019'])
  ###  Note that there are two lists returned
  ###  Year list is used when only year or year range is mentioned

2.3 Word count and word-based filtered database

These two functions are only exploratory, no need to evaluate.
Only nouns and adjectives are kept for Word count, since they are loaded with more meaning.
Users can specify columns to search keywords, case_sensitive is also supported.
Sample outputs are shown as follows:

word-based filtered database with word 'African':

4-images

word-count with number of terms=10

5-images

3. Document classification

In this part, the goal is to do classification of the documents based on UNBIS.
We build algorithm based on Bidirectional-LSTM, relying on preamble, operatives and title.

Model and methodology:

Instead of using pre-trained embedding layer directly, we set up this layer from scratch.
3 LSTMs are applied parallelly. They are expected to deal with preamble, operatives, title separately.
Dropout layer added to fight against overfitting.

6-images

Results and evaluation

Using 1271 human-labeled documents.
Overall accuracy is around 94%.
Considering the labelling method, this model may rely too much on title.

7-images

Sample predictions

The figure below shows the predictions to some of the testing data.

8-images

4. Content analysis

In this part, we applied NER(Name Entity Recognition) and LDA for Topic Modeling.

4.1 LDA topic modeling

Latent Dirichlet Allocation (LDA) allows a sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
Users can input all of the database, or subsets of database filtered by keywords or categories.
Sample output: (original HTML)

9-images

4.2 Named Entity Recognitions

NER speeds up the information extraction process by recognizing, locating and classifying named entities in the documents into pre-defined categories such as names of persons or organizations.
Trained NER entities for UN resolutions: persons, organizations, date, law and places labels.
We use displaCy visualizer from Spacy to display the labeled texts from documents.
After 250 times iterations, demo result is shown as follows.

13-images

5. Django website

In order to demonstrate the results with the user-friendly interface , a repository website is established.
This website is still under construction...
Categories are the result of classifications according to UNBIS.
Labels are the aggregation of five top words for each document.

Current views: 10-images 12-images

Original Code

1.click here for basic.py (bottom)
2.click here for quick demo

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published