Skip to content
Jean-Marc Vanel edited this page Jan 16, 2019 · 12 revisions

Table of Contents

Introduction

My approach to semantization is to go as fast as possible to an RDF syntax, and then to treat it with RDF tools. For XML, I use Gloze which makes a generic translation of XML, then a rule base in N3 with Euler / Eye.

Different methods

There are several methods:

  1. XML transform methods: XSLT or XQuery
  2. RDF transform methods: canonical (direct mapping) conversion in RDF, then SPARQL or other RDF technique
  3. tabular methods: use RDF property URI's or Turtle terms as columns names in CSV, then convert in RDF with semantic_forms in command line, see Semantize raw stuff
  4. JSON methods:

An example of method 1: https://framagit.org/Scrutari/RDFexport/blob/master/scrutari-to-rdf.xslt

Below details on method 2.

For 3 (CSV), there is a useful site, but it's limited to SKOS: http://labs.sparna.fr/skos-play/convert

Convert CSV in RDF

The CSV file can be conveniently obtained via LibreOffice.

Step 1: direct mappping

A typical first part of the data mill (direct mappping):

java -cp $JARS deductions.runtime.connectors.CSVImporterApp  \
  "/home/jmv/data/Voisins sur site.csv" \
  urn:gv/contacts \
  ~/data/adding-details-each_row.ttl \
  ,

The arguments are:

  • input CSV file or URL (with header row)
  • base URL for the rows,
  • URL or File ( in Turtle ) for adding details to each row
  • separator

It leverages on the column to URI mappings obtained from the first row. The first row must consist of property URI's (or abbreviated Turtle URI's with well known prefixes). Here is a typical example of an RDF CSV file (nature field trips):

dbo:startDate, rdfs:label, dct:subject, schema:performer, schema:departureStation, nature:returnStation,,,,,
D 31 mars 2019, Forêt de Nanteau-Poloigny, Lichénologie, G. Carlier, départ Paris-Gare de Lyon à 8h16 pour Bagneaux-sur-Loing 9h23, retour de Bagneaux à 17h30 + ANVL + CNCE,,,,,
D 14 avril 2019, Forêt de Fontainebleau, mycologie et botanique, JP CHABRIER et A LAURON, Paris Gare de Lyon 8h16 pour Montigny-sur-Loing 9h04, retour de Tomery 18h58 + CNCE + ANVL + SMF,,,,,

When a column header is not a URI, the tools creates one with the given prefix and the header.

The file adding-details-each_row.ttl contains in this example:

PREFIX schema: <http://schema.org/>
<any:ROW> a schema:Event .
<any:ROW> <http://purl.org/NET/c4dm/event.owl#agent> <http://naturalistes.chez.com/> .

Or, in another example:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<any:ROW> a foaf:Organization .

When a cell in tabular data looks like an URI or abbreviated Turtle URI's with well known prefixes, it is expanded.

A sample produced by this step is:

<urn:gv/contacts/row/117>
        a       <http://vocab.sindice.net/csv/Row> , <http://xmlns.com/foaf/0.1/Organization> ;
        <http://purl.org/dc/elements/1.1/subject>
                "Equipe de psychologue à domicile" ;
        <http://vocab.sindice.net/csv/rowPosition>
                "117" ;
        <http://www.virtual-assembly.org/ontologies/1.0/pair#administrativeName>
                "Aurore équipe mobile" ;
        <http://www.virtual-assembly.org/ontologies/1.0/pair#arrivalDate>
                "01/10/2016" ;
        <http://www.virtual-assembly.org/ontologies/1.0/pair#building>
                "Oratoire" ;
        <http://www.virtual-assembly.org/ontologies/1.0/pair#room>
                "1er étage" ;
        <http://www.virtual-assembly.org/ontologies/1.0/pair#status>
                "Contribue" ;
        <http://xmlns.com/foaf/0.1/familyName>
                "Auffret" ;
        <http://xmlns.com/foaf/0.1/givenName>
                "Marianne" ;
        <http://xmlns.com/foaf/0.1/img>
                <http://www.larchipel.paris/wp-content/uploads/2014/10/logo-sans-rup-2012.jpg> ;
        <http://xmlns.com/foaf/0.1/mbox>
                "m.auffret@aurore.asso.fr" ;
        <http://xmlns.com/foaf/0.1/name>
                "Equipe mobile" .

Step 2: transform the graph by SPARQL

Then a second step can be a SPARQL query that will for instance split Person and Organization fields that are mixed in a single CSV row. Here is a typîcal SPARQL query for extracting Person (foaf:Organization has already been created in first step). Here, we copy Person specific properties into a new foaf:Person instance, and the rest of the properties are copied to a newly forged URI. Note that we must create a triple to connect the person and the organization. Note that we need SPARQL Update for manipulating graphs.

prefix dc: <http://purl.org/dc/elements/1.1/>
prefix dct: <http://purl.org/dc/terms/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix cco: <http://purl.org/ontology/cco/core#>
prefix pair: <http://virtual-assembly.org/pair_v2#>

DELETE { GRAPH <urn:gv/contacts> {?ORGA ?P ?O .}}
INSERT { GRAPH <urn:/Voisins/new> {
  ?PERSON a foaf:Person ;
          foaf:familyName ?FN ;
          foaf:givenName ?GN ;
          foaf:mbox ?MB ;
          foaf:phone ?PH .

  ?newURI a foaf:Organization .
  ?newURI ?P ?O .
  ?newURI pair:hasResponsible ?ORGA .
}}
WHERE { GRAPH <urn:gv/contacts> {
  ?PERSON a <http://vocab.sindice.net/csv/Row> ;
          foaf:familyName ?FN ;
          foaf:givenName ?GN ;
          foaf:mbox ?MB ;
          foaf:phone ?PH .

  ?ORGA ?P ?O .
  FILTER( ?P != foaf:familyName &&
          ?P != foaf:givenName &&
          ?P != foaf:mbox ?MB &&
          ?P != foaf:phone )
   BIND (URI(CONCAT( STR(?ORGA), "-org") ) AS ?newURI)
}}

This can be conveniently run in command line:

java -cp $JARS tdb.tdbupdate --loc=TDB \
     --update=~/src/assemblee-virtuelle.github.io/grands-voisins-v2/split-person-orga.rq

Step 3 (facultative) generate skeleton ontology from instance

Using Euler / Eye rule engine for Trutle and N3 rule language:

eye --nope --query rules-documentorQ.n3 rules-documentor.n3 \
     ~/data/Non_public/Voisins-sur-site.csv.ttl ~/ontologies/foaf.n3 \
  > ~/src/assemblee-virtuelle.github.io/grands-voisins-v2/onto.ttl

Step 4: dispatching triples to user-specific named graphs

Then step 4 can be dispatching triples to user-specific named graphs, so that each user has complete edit and delete rights on her data. For this , there is application UserNamedGraphsDispatcherApp . in command line:

java -cp $JARS deductions.runtime.sparql_cache.algos.UserNamedGraphsDispatcherApp urn:/Voisins/new

Convert XML in RDF

Step 1: XML to raw RDF (direct mapping)

Prerequisite : have installed Gloze source in ~/src/Gloze .

BASE_URI="$HOME/mystuff"
XML_INPUT=/usr/share/accountsservice/interfaces/com.ubuntu.AccountsService.Input.xml
source ~/src/Gloze/setClasspath.sh
java -cp $LIBS \
    -Dgloze.lang=N3 \
    -Dgloze.base=$BASE_URI \
    -Dgloze.verbose=true \
  com.hp.gloze.GlozeURL \
    $XML_INPUT \
  > `basename $XML_INPUT.ttl`

Step 2: raw RDF to meaningful RDF

Meaningful RDF means here RDF obeying a well-known vocabulary, and when possible referencing well-known URI's like those of dbPedia (Linked Open Data). Let's take the example of a KML input, the sample in the Wikipedia KML page.

<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
  <name>New York City</name>
  <description>New York City</description>
  <Point>
    <coordinates>-74.006393,40.714172,0</coordinates>
  </Point>
</Placemark>
</Document>
</kml>

Here is what is obtained by Gloze :

@base          <file:///home/jmv/mystuff> .
</home/jmv/mystuff>  <http://www.opengis.net/kml/2.2#kml>
                [ <http://www.opengis.net/kml/2.2#Document>
                          [ <http://www.opengis.net/kml/2.2#Placemark>
                                    [ <http://www.opengis.net/kml/2.2#Point>
                                              [ <http://www.opengis.net/kml/2.2#coordinates>
                                                        "-74.006393,40.714172,0" ] ;
                                      <http://www.opengis.net/kml/2.2#description>
                                              "New York City" ;
                                      <http://www.opengis.net/kml/2.2#name>
                                              "New York City"
                                    ] ] ] .

As you can see, every instance of an XML tag becomes an instance of an RDF predicate. In beetwin predicates are inserted blank nodes introduced by the [] notation. We could have given to Gloze the XML Schema of KML, but I don't bother that .

Our target vocabulary will be WGS84 Geo RDF (W3C's Geo ontology). We could have chosen the GeoNames ontology: http://www.geonames.org/ontology/documentation.html

Here are some N3 rules. N3 is language similar to SPARQL. In fact N3 was the main inspiration for SPARQL.

A KML Placemark becomes a Geo SpatialThing :

PREFIX kml: <http://www.opengis.net/kml/2.2#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
{?CONTAINER kml:Placemark ?PLACE} => {?PLACE a geo:SpatialThing}

A KML Placemark has a KML Point; W3C's Geo ontology does allow a SpatialThing to have several points through location property.

{?PLACE kml:Point ?POINT} => {?POINT a geo:SpatialThing . ?PLACE geo:location ?POINT.}

A kml:name becomes a rdfs:label.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
{?S  kml:name ?NAME } => { ?S rdfs:label ?NAME }

description

TO BE CONTINUED

Convert RDF in XML

BASE_URI="$HOME/mystuff"
TURTLE_INPUT=http://jmvanel.free.fr/jmv.rdf
source ~/src/Gloze/setClasspath.sh
java -cp $LIBS \
    -Dgloze.lang=N3 \
    -Dgloze.base=$BASE_URI \
    -Dgloze.verbose=true \
    -Dgloze.order=seq \
  com.hp.gloze.GlozeURL \
    $TURTLE_INPUT \
  > $TURTLE_INPUT.xml

Clone this wiki locally