This project is a joint effort to combine technologies from LODIE_WI and REX.
LODIE_WI is the Wrapper Induction component from the LODIE project. LODIE_WI provides methods to learn website wrappers. The assumptions are:
- we have a given Knowledge base K (e.g. DBpedia)
- we want to extract instances of a particular concept C and its attributes (as defined in K)
- there is a fixed domain D (e.g. Book) D ={d1 … dn} where each di is a set of homogeneous entity-centric webpages i.e.:
- each webpage in di belongs to the same website (and share a common template)
- each webpage in di describes one entity e of type C
The method takes as input a set of homogeneous entity-centric webpages di describing entities of type C; for each attribute to extract the method takes as input a gazetteer with possible values for the attribute, obtained from K. Each gazetteer can be constructed with varying degrees of complexity, from simple SPARQL query, to more complex ones, can be cleaned with outlier detection strategies etc. In this project the gazetteers are assumed given; facilities to generate gazetteers will be released separately. The method generates:
- a set of xpath extractors (the cardinality of the set can be 0, 1 or multiple)
- results of the extraction performed on di applying the xpath extractors.
A very short presentation can be found here
Relevant papers:
-
AI Magazine 2015. Anna Lisa Gentile, Ziqi Zhang and Fabio Ciravegna (2015). Early Steps Towards Web Scale Information Extraction with LODIE. AI Magazine, 36(1), 55--64.
-
KCAP 2013. Anna Lisa Gentile, Ziqi Zhang, Isabelle Augenstein and Fabio Ciravegna (2013). Unsupervised wrapper induction using linked data. Proceedings of the seventh international conference on Knowledge capture, 41--48. Banff, Canada: ACM
-
TSD 2014. Anna Lisa Gentile, Ziqi Zhang and Fabio Ciravegna (2014). Self Training Wrapper Induction with Linked Data. Text, Speech and Dialogue - 17th International Conference, {TSD} 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings, 285--292. Paper PREPRINT
-
ISWC 2014. Anna Lisa Gentile and Suvodeep Mazumdar (2014). User driven Information Extraction with LODIE. Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference (ISWC 2014), 385-388.
You can also view a less than two minutes demo video.
The folder resources contains:
- gazetteers that are used to seed the annotation phase. These gazetteers have been automatically generated, but are given as static resource here for reproducibility. Relevant gazetteers are provided for all domain-attributes tackled in the the evaluation datasets
- evaluation datasets with the relative groundtruth
- the temp folder is the default location where the method creates intermediate representations of pages.
The folder experimentResults is the default location where the method saves experimental results.
Input is provided as follows:
- a folder D which represents the domain and contains subfolders di containing a set of homogeneous entity-centric webpages; each webpage is a single html file
- for test purposes we provide example files for the book book domain; the subfolders book-booksamillion-2000 and book-christianbook-2000 contain 2000 pages each, describing books, respectively from http://www.booksamillion.com/ and http://www.christianbook.com/ websites.
The original webpages are transformed in an internal xpath-value representation where:
- we extract all text nodes for each page
- we save each page as the collection of its text nodes, as pairs of xpath expression to reach the node - text content of the node
- the internal xpath-value representation of pages can be obtained using methods provided in the class ReducePagesToXpath
- for test purposes the main method in ReducePagesToXpath will produce the xpath-value representation of book and save it in the temp folder.
Given:
- a set of homogeneous entity-centric webpages di in xpath-value representation
- the attribute pj to extract and its relevant gazetter
The method reduces each webpage to a set of xpath-value pairs, which are the candidate xpath extractors as found on the page. To create such set, the method matches all the values in the page against the provided gazetteers, and retains only the xpath-value pairs where the value is a strict match.
Given:
- the set of candidate xpath-value pairs obtained in previous step, this method implements heuristics to remove spourions xpaths.
This method is optionally applied.
Given:
- the candidate set of xpath-value pairs for all the pages from di
The method produces a ranke list of xpaths which are the extractors for attribute pj from pages in di