ie-benchmarks

speed benchmarks for information extraction systems

Data

We provide a random sample of 5K and 10K articles from the English Wikipedia for use in benchmarking.

These articles were processed using the ie-benchmarks branch of wikiparse, which in turn uses the org.clulab.processor.CluProcessor (v7.51) text annotator for sentence segmentation, tokenization, part of speech tagging, lemmatization, chunking, named entity recognition, and dependency parsing.

As of this writing, the following link will retrieve the most recently completed dump of the English Wikipedia:

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

A select number of recent dumps can be found at https://dumps.wikimedia.org/enwiki/.

The sample datasets released here were generated from the June 2, 2014 dump of the English Wikipedia.

5K article sample

# Download a random sample of 5K EN Wikipedia articles
curl https://public.lum.ai/ie-benchmarks/parsed-documents/wikipedia/en/5K.tar.gz --output 5K.tar.gz
# unpack the archive
tar xvzf 5K.tar.gz

10K article sample

# Download a random sample of 10K EN Wikipedia articles
curl https://public.lum.ai/ie-benchmarks/parsed-documents/wikipedia/en/10K.tar.gz --output 10K.tar.gz
# unpack the archive
tar xvzf 10K.tar.gz

Odinson

Building an Odinson index

5K article sample

sbt "odinson/runMain ai.lum.benchmarks.odinson.IndexDocuments -i 5K -o 5k-index"

10K article sample

sbt "odinson/runMain ai.lum.benchmarks.odinson.IndexDocuments -i 10K -o 10k-index"

Benchmarking

sbt "odinson/runMain ai.lum.benchmarks.odinson.BenchmarkQueries -i 5k-index -q queries/odinson/president.txt -n 1000 -o output/5k/odinson"

Odin

Benchmarking

sbt "odin/runMain ai.lum.benchmarks.odin.BenchmarkQueries -d 5K -g queries/odin/system.yml -n 1000 -o output/5k/odin"

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
blacklab		blacklab
koko		koko
odin		odin
odinson		odinson
project		project
queries		queries
shared		shared
spacy		spacy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ie-benchmarks

Data

5K article sample

10K article sample

Odinson

Building an Odinson index

5K article sample

10K article sample

Benchmarking

Odin

Benchmarking

About

Releases

Packages

Languages

License

lum-ai/ie-benchmarks

Folders and files

Latest commit

History

Repository files navigation

ie-benchmarks

Data

5K article sample

10K article sample

Odinson

Building an Odinson index

5K article sample

10K article sample

Benchmarking

Odin

Benchmarking

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages