Skip to content

AnLiGentile/WREN

Repository files navigation

WREN

Table of contents

Introduction

This project is a joint effort to combine technologies from LODIE_WI and REX.

LODIE_WI

LODIE_WI is the Wrapper Induction component from the LODIE project. LODIE_WI provides methods to learn website wrappers. The assumptions are:

  • we have a given Knowledge base K (e.g. DBpedia)
  • we want to extract instances of a particular concept C and its attributes (as defined in K)
  • there is a fixed domain D (e.g. Book) D ={d1 … dn} where each di is a set of homogeneous entity-centric webpages i.e.:
    • each webpage in di belongs to the same website (and share a common template)
    • each webpage in di describes one entity e of type C

The method takes as input a set of homogeneous entity-centric webpages di describing entities of type C; for each attribute to extract the method takes as input a gazetteer with possible values for the attribute, obtained from K. Each gazetteer can be constructed with varying degrees of complexity, from simple SPARQL query, to more complex ones, can be cleaned with outlier detection strategies etc. In this project the gazetteers are assumed given; facilities to generate gazetteers will be released separately. The method generates:

  • a set of xpath extractors (the cardinality of the set can be 0, 1 or multiple)
  • results of the extraction performed on di applying the xpath extractors.

A very short presentation can be found here

Relevant papers:

You can also view a less than two minutes demo video.

REX

Resources

The folder resources contains:

  • gazetteers that are used to seed the annotation phase. These gazetteers have been automatically generated, but are given as static resource here for reproducibility. Relevant gazetteers are provided for all domain-attributes tackled in the the evaluation datasets
  • evaluation datasets with the relative groundtruth
  • the temp folder is the default location where the method creates intermediate representations of pages.

The folder experimentResults is the default location where the method saves experimental results.

Workflow

Input

Input is provided as follows:

Page pre-processing

The original webpages are transformed in an internal xpath-value representation where:

  • we extract all text nodes for each page
  • we save each page as the collection of its text nodes, as pairs of xpath expression to reach the node - text content of the node
  • the internal xpath-value representation of pages can be obtained using methods provided in the class ReducePagesToXpath
  • for test purposes the main method in ReducePagesToXpath will produce the xpath-value representation of book and save it in the temp folder.
Identifying extractors for each concept attribute
Candidate patterns for entity attributes

Given:

  • a set of homogeneous entity-centric webpages di in xpath-value representation
  • the attribute pj to extract and its relevant gazetter

The method reduces each webpage to a set of xpath-value pairs, which are the candidate xpath extractors as found on the page. To create such set, the method matches all the values in the page against the provided gazetteers, and retains only the xpath-value pairs where the value is a strict match.

Boilerplate removal

Given:

  • the set of candidate xpath-value pairs obtained in previous step, this method implements heuristics to remove spourions xpaths.

This method is optionally applied.

Pattern ranking

Given:

  • the candidate set of xpath-value pairs for all the pages from di

The method produces a ranke list of xpaths which are the extractors for attribute pj from pages in di

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published