WEXEA

WEXEA is an exhaustive Wikipedia entity annotation system, to create a text corpus based on Wikipedia with exhaustive annotations of entity mentions, i.e. linking all mentions of entities to their corresponding articles.

WEXEA runs through several stages of article annotation and the final articles can be found in the 'final_articles' folder in the output directory. Articles are separately stored in a folder named after the first 3 letters of the title (lowercase) and sentences are split up leading to one sentence per line. Annotations follow the Wikipedia conventions, just the type of the annotation is added at the end.

Downloads

WEXEA for...

These datasets can be used as-is. Each archive contains a single file with each article concatenated. Articles themselves contain original as well as new annotations of the following format:

[[article name|text alias|annotation type]]
[[text alias|CoreNLP NER type]]

The annotation type of format 1 can be ignored (type "annotation" corresponds to original annotations, all others are new). Annotations of format 2 are CoreNLP annotations without corresponding Wikipedia article.

Start CoreNLP toolkit

Download (including models for languages other than English) CoreNLP from https://stanfordnlp.github.io/CoreNLP/index.html

Start server:

java -mx16g -cp "<path to corenlp files>" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -threads 6

Entity Linker

Entity Linker including models used from https://github.com/nitishgupta/neural-el.

Download resources from repository and adjust path to resources folder in src/entity_linker/configs/config.ini.

Run WEXEA

Change language specific keyword variables in src/language_variables.py, depending on Wikipedia dump language.
Install requirements from requirements.txt (Tensorflow only needed for neural EL).
In config/config.json, provide path of latest wiki dump (xml file) and output path (make sure the output folder does not exist yet, it will be created).
Make annotate.sh executable: "chmod 755 annotate.sh"
In annotate.sh: Either use parser_4.py (with neural EL; English only) or parser_4_greedy.py (greedy EL).
Run annotate.sh with ./annotate.sh

Visualization

server.py starts a server and opens a website that can be used to visualize an article with Wikipedia links (blue) and unknown entities (green).

WiNER evaluation

Create directory 'wexea_evaluation'.
Adjust directory names for output as well as dictionaries in src/winer.py and src/evaluation.py.
Run winer.py in order to create a sample from WiNER's as well as WEXEA's articles.
Run src/evaluation.py in order to create a file per article, which consists of WiNER addtional annotations (left) and WEXEA's additional annotations (right). Original annotations are removed in order to make the file more readable.

Files we used for evaluation (see Michael Strobl's PhD thesis), can be found in the data folder.

Hardware requirements

32GB of RAM are required (it may work with 16, but not tested) and it should take around 2-3d to finish with a full English Wikipedia dump (less for other languages).

Parsers

Time consumption was measured when running on a Ryzen 7 2700X with 64GB of memory. Data was read from and written to a hard drive. Runtimes lower for languages other than English.

Parser 1 (~2h 45 min / ~4.6GB memory in total / 20,993,369 articles currently):

Create all necessary dictionaries.

Parser 2 (~1h 45 mins with 6 processes / ~6,000,000 articles to process)

Removes most Wiki markup, irrelevant articles (e.g. lists or stubs), extracts aliases and separates disambiguation pages.

A number of processes can be set to speed up the parsing process of all articles. However, each process consumes around 7.5GB of memory.

Parser 3 (~2 days with 6 processes / ~2,700,000 articles to process)

Run CoreNLP NER and find other entities based on alias/redirect dictionaries.

Parser 4 (~2h / ~2,700,000 articles to process)

Run co-reference resolution and EL.

Citation

Please cite the following papers:

Original WEXEA publication:

Strobl, Michael, Amine Trabelsi, and Osmar R. Zaïane. "WEXEA: Wikipedia exhaustive entity annotation." Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020.

Updated version (from which the linked datasets above are derived):

Strobl, Michael, Amine Trabelsi, and Osmar R. Zaiane. "Enhanced Entity Annotations for Multilingual Corpora." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
config		config
data		data
src		src
sutime		sutime
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
annotate.sh		annotate.sh
requirements.txt		requirements.txt
wexea.html		wexea.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WEXEA

Downloads

Start CoreNLP toolkit

Entity Linker

Run WEXEA

Visualization

WiNER evaluation

Hardware requirements

Parsers

Parser 1 (~2h 45 min / ~4.6GB memory in total / 20,993,369 articles currently):

Parser 2 (~1h 45 mins with 6 processes / ~6,000,000 articles to process)

Parser 3 (~2 days with 6 processes / ~2,700,000 articles to process)

Parser 4 (~2h / ~2,700,000 articles to process)

Citation

About

Releases

Packages

Languages

License

ad-freiburg/WEXEA

Folders and files

Latest commit

History

Repository files navigation

WEXEA

Downloads

Start CoreNLP toolkit

Entity Linker

Run WEXEA

Visualization

WiNER evaluation

Hardware requirements

Parsers

Parser 1 (~2h 45 min / ~4.6GB memory in total / 20,993,369 articles currently):

Parser 2 (~1h 45 mins with 6 processes / ~6,000,000 articles to process)

Parser 3 (~2 days with 6 processes / ~2,700,000 articles to process)

Parser 4 (~2h / ~2,700,000 articles to process)

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages