WREN

we have a given Knowledge base K (e.g. DBpedia)
we want to extract instances of a particular concept C and its attributes (as defined in K)
there is a fixed domain D (e.g. Book) D ={d₁ … d_n} where each d_i is a set of homogeneous entity-centric webpages i.e.:
- each webpage in d_i belongs to the same website (and share a common template)
- each webpage in d_i describes one entity e of type C

The method takes as input a set of homogeneous entity-centric webpages d_i describing entities of type C; for each attribute to extract the method takes as input a gazetteer with possible values for the attribute, obtained from K. Each gazetteer can be constructed with varying degrees of complexity, from simple SPARQL query, to more complex ones, can be cleaned with outlier detection strategies etc. In this project the gazetteers are assumed given; facilities to generate gazetteers will be released separately. The method generates:

a set of xpath extractors (the cardinality of the set can be 0, 1 or multiple)
results of the extraction performed on d_i applying the xpath extractors.

A very short presentation can be found here

Relevant papers:

AI Magazine 2015. Anna Lisa Gentile, Ziqi Zhang and Fabio Ciravegna (2015). Early Steps Towards Web Scale Information Extraction with LODIE. AI Magazine, 36(1), 55--64.
KCAP 2013. Anna Lisa Gentile, Ziqi Zhang, Isabelle Augenstein and Fabio Ciravegna (2013). Unsupervised wrapper induction using linked data. Proceedings of the seventh international conference on Knowledge capture, 41--48. Banff, Canada: ACM
TSD 2014. Anna Lisa Gentile, Ziqi Zhang and Fabio Ciravegna (2014). Self Training Wrapper Induction with Linked Data. Text, Speech and Dialogue - 17th International Conference, {TSD} 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings, 285--292. Paper PREPRINT
ISWC 2014. Anna Lisa Gentile and Suvodeep Mazumdar (2014). User driven Information Extraction with LODIE. Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference (ISWC 2014), 385-388.

You can also view a less than two minutes demo video.

REX

Resources

The folder resources contains:

gazetteers that are used to seed the annotation phase. These gazetteers have been automatically generated, but are given as static resource here for reproducibility. Relevant gazetteers are provided for all domain-attributes tackled in the the evaluation datasets
evaluation datasets with the relative groundtruth
the temp folder is the default location where the method creates intermediate representations of pages.

The folder experimentResults is the default location where the method saves experimental results.

Workflow

Input

Input is provided as follows:

a folder D which represents the domain and contains subfolders d_i containing a set of homogeneous entity-centric webpages; each webpage is a single html file
for test purposes we provide example files for the book book domain; the subfolders book-booksamillion-2000 and book-christianbook-2000 contain 2000 pages each, describing books, respectively from http://www.booksamillion.com/ and http://www.christianbook.com/ websites.

Page pre-processing

The original webpages are transformed in an internal xpath-value representation where:

we extract all text nodes for each page
we save each page as the collection of its text nodes, as pairs of xpath expression to reach the node - text content of the node
the internal xpath-value representation of pages can be obtained using methods provided in the class ReducePagesToXpath
for test purposes the main method in ReducePagesToXpath will produce the xpath-value representation of book and save it in the temp folder.

Identifying extractors for each concept attribute

Candidate patterns for entity attributes

Given:

a set of homogeneous entity-centric webpages d_i in xpath-value representation
the attribute p_j to extract and its relevant gazetter

The method reduces each webpage to a set of xpath-value pairs, which are the candidate xpath extractors as found on the page. To create such set, the method matches all the values in the page against the provided gazetteers, and retains only the xpath-value pairs where the value is a strict match.

Boilerplate removal

Given:

the set of candidate xpath-value pairs obtained in previous step, this method implements heuristics to remove spourions xpaths.

This method is optionally applied.

Pattern ranking

Given:

the candidate set of xpath-value pairs for all the pages from d_i

The method produces a ranke list of xpaths which are the extractors for attribute p_j from pages in d_i

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
experimentResults		experimentResults
extractionResults		extractionResults
repository/org/aksw/REX		repository/org/aksw/REX
resources		resources
src		src
temp		temp
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WREN

Table of contents

Introduction

LODIE_WI

REX

Resources

Workflow

Input

Page pre-processing

Identifying extractors for each concept attribute

Candidate patterns for entity attributes

Boilerplate removal

Pattern ranking

About

Releases

Packages

Contributors 3

Languages

AnLiGentile/WREN

Folders and files

Latest commit

History

Repository files navigation

WREN

Table of contents

Introduction

LODIE_WI

REX

Resources

Workflow

Input

Page pre-processing

Identifying extractors for each concept attribute

Candidate patterns for entity attributes

Boilerplate removal

Pattern ranking

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages