speed benchmarks for information extraction systems
We provide a random sample of 5K and 10K articles from the English Wikipedia for use in benchmarking.
These articles were processed using the ie-benchmarks
branch of wikiparse
, which in turn uses the org.clulab.processor.CluProcessor
(v7.51) text annotator for sentence segmentation, tokenization, part of speech tagging, lemmatization, chunking, named entity recognition, and dependency parsing.
As of this writing, the following link will retrieve the most recently completed dump of the English Wikipedia:
A select number of recent dumps can be found at https://dumps.wikimedia.org/enwiki/.
The sample datasets released here were generated from the June 2, 2014 dump of the English Wikipedia.
# Download a random sample of 5K EN Wikipedia articles
curl https://public.lum.ai/ie-benchmarks/parsed-documents/wikipedia/en/5K.tar.gz --output 5K.tar.gz
# unpack the archive
tar xvzf 5K.tar.gz
# Download a random sample of 10K EN Wikipedia articles
curl https://public.lum.ai/ie-benchmarks/parsed-documents/wikipedia/en/10K.tar.gz --output 10K.tar.gz
# unpack the archive
tar xvzf 10K.tar.gz
sbt "odinson/runMain ai.lum.benchmarks.odinson.IndexDocuments -i 5K -o 5k-index"
sbt "odinson/runMain ai.lum.benchmarks.odinson.IndexDocuments -i 10K -o 10k-index"
sbt "odinson/runMain ai.lum.benchmarks.odinson.BenchmarkQueries -i 5k-index -q queries/odinson/president.txt -n 1000 -o output/5k/odinson"
sbt "odin/runMain ai.lum.benchmarks.odin.BenchmarkQueries -d 5K -g queries/odin/system.yml -n 1000 -o output/5k/odin"