Skip to content

SCDH/xtriples-micro

Repository files navigation

xtriples-micro – An XTriples Processor for Micro Services and Local Usage

Tests Create release

xtriples-micro is an implementation of a XTriples processor that works without an eXist datebase.

XTriples? In XTriples, instead of writing specialized programs in XSLT, XQuery, Python, etc. for extracting RDF triples from XML documents, we write configuration files containing selectors. These config files are evaluated by an XTriples processor, which returns RDF triples. Here's an example of such a configuration file:

<?xml-model uri="https://xtriples.lod.academy/xtriples.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<xtriples>
    <configuration>
        <vocabularies>
            <vocabulary prefix="gods" uri="https://xtriples.lod.academy/examples/gods/"/>
            <vocabulary prefix="rdf" uri="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
            <vocabulary prefix="rdfs" uri="http://www.w3.org/2000/01/rdf-schema#"/>
            <vocabulary prefix="foaf" uri="http://xmlns.com/foaf/0.1/"/>
        </vocabularies>
        <triples>
            <statement>
                <subject prefix="gods">/@id</subject>
                <predicate prefix="rdf">type</predicate>
                <object prefix="foaf" type="uri">Person</object>
            </statement>
            <statement>
                <subject prefix="gods">/@id</subject>
                <predicate prefix="rdfs">label</predicate>
                <object type="literal" lang="en">/name/english</object>
            </statement>
            <statement>
                <subject prefix="gods">/@id</subject>
                <predicate prefix="rdfs">label</predicate>
                <object type="literal" lang="gr">/name/greek</object>
            </statement>
            <statement>
                <subject prefix="gods">/@id</subject>
                <predicate prefix="rdfs">seeAlso</predicate>
                <object type="uri">/concat("http://en.wikipedia.org/wiki/", $currentResource/name/english)</object>
            </statement>
        </triples>
    </configuration>
    <collection uri="?select=[0-9]+.xml">
	   <resource uri="{//god}"/>
    </collection>
</xtriples>

While the original XTriples processor requires an eXist database and applies a configuration only on the fixed set of XML files contained in it, the implementation at hand runs outside of a database, e.g., on a local set of documents. It can also be deployed on the famous SEED XML Transformer. This deployment gives you a lightweight microservice, where you can send a single XML document and a config file to and get RDF triples in return.

Getting started

Microservice

TODO

Oxygen

This project offers an Oxygen framework, that assists writing XTriples configuration files and also provides transformation scenarios for applying a configuration to a single or a collection of documents. Installation is as simple as using the following installation link in the installation dialog found in Help -> Install New Addons:

https://scdh.github.io/xtriples-micro/descriptor.xml

See detailed description in the Wiki!

XSLT Package

For using the XTriples engine in CI/CD pipelines or in downstream projects, installation of a released package is the way to go. The Wiki gives detailed instructions!

Playing around and Testing

For playing around with XTriples and validating that it is suitable technology, you can also clone this repository. It comes with a fully reproducible tooling environment that installs all tools needed for running and testing in a sandbox. You only need a Java development kit (JDK) installed. On debian-based systems, you can install it with sudo apt install openjdk.

To set up the tooling environment, clone this repository, cd into your working copy and run:

./mvnw package  # Linux

or

mvnw.cmd package   # Windows

This will download Saxon-HE etc. and generate wrapper files, that set up the classpath for using them.

After running the command above, the wrapper scripts are in target/bin/. E.g., there are a wrappers around Saxon-HE and Jena RIOT:

target/bin/xslt.sh -?
target/bin/riot.sh -h

Extracting RDF Triples

There are XSLT stylesheets, that do the work of evaluating an XTriples configuration file and applying it to XML documents.

extract.xsl

xsl/extract.xsl extracts from an XML document given as source by applying a configuration passed in via the stylesheet parameter config-uri.

target/bin/xslt.sh -xsl:xsl/extract.xsl -s:test/gods/1.xml config-uri=$(realpath test/gods/configuration.xml)

The output should look like this:

<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>  .
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/2000/01/rdf-schema#label> "Aphrodite"@en  .
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/2000/01/rdf-schema#label> "Ἀφροδίτη"@gr  .
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://en.wikipedia.org/wiki/Aphrodite>  .

If your result is polluted with debug messages, you can append 2> /dev/null to silence them or use Saxon's -o: option to send the output to a file. They are printed to stderr.

If you want an other format, pipe the result to Jena RIOT like so:

target/bin/xslt.sh -xsl:xsl/extract.xsl -s:test/gods/1.xml config-uri=$(realpath test/gods/configuration.xml) | target/bin/riot.sh --out rdf/xml

Here's the result:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:j.0="http://xmlns.com/foaf/0.1/" >
  <rdf:Description rdf:about="https://xtriples.lod.academy/examples/gods/1">
    <rdfs:seeAlso rdf:resource="http://en.wikipedia.org/wiki/Aphrodite"/>
    <rdfs:label xml:lang="gr">Ἀφροδίτη</rdfs:label>
    <rdfs:label xml:lang="en">Aphrodite</rdfs:label>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
  </rdf:Description>
</rdf:RDF>

This is the only transformation that makes sense deploying on a micro service. See seed.

extract-collection.xsl

xsl/extract-doc-param.xsl takes a configuration as source document and applies it to the collecton of XML documents given in /xtriples/collection/@uri, which is interpreted as a Saxon collection URI. See section Implementation of the Specs for details. This is compatible to the reference implementation.

Example:

target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:test/gods/configuration.xml

This will extract triples from all the God files in test/gods due to the collection URI <collection uri="?select=[0-9]+.xml">. It is a relative URI (current directory .), and the select query string is interpreted by the Saxon processor.

extract-doc-param.xsl

xsl/extract-doc-param.xsl takes a configuration as source document and applies it to an XML document referenced by the source-uri stylesheet parameter.

target/bin/xslt.sh -xsl:xsl/extract-param-doc.xsl -s:test/gods/configuration.xml source-uri=$(realpath test/gods/1.xml)

Writing configurations

  1. The content of <subject>, <predicate>, <object> and <condition> is evaluated as an XPath expression, if and only if the content starts with a slash /. Before the expression is evaluated, it is prepended with $currentResource (or $externalResource respectively). E.g., /@id is evaluated as $currentResource/@id. In <condition> the XPath is constructed like this: xs:boolean($currentResource CONDITION ).
  2. Keep the difference of document vs. resource in mind: Each document may contain multiple resources if /xtriples/collection/resource/@uri is used to unnest resources from a document. The variable $currentResource and $resourceIndex provide access to the resource and its index.
  3. This resource context is transparent to the underlying document. Thus, accessing parts of the document outside of the context subtree is possible: $currentResource/ancestor::TEI/teiHeader.
  4. The XPath evaluation uses namespaces made up from the prefix-to-URI mapping from the <vocabularies> section of the configuration file. Thus:
    • If you want to extract RDF from non-namespace XML sources, do not use the empty string prefix in the vocabularies, since that would bind the default namespace for XPath evaluation to this vocabulary URI.
    • Be careful about using the default namespace, since it is not compatible with the reference implementation. See below!
  5. Using BNodes may be a bit tricky. See these hints.

Implementation of the Specs

This is a full implementation of the XTriples spec.

Additional Features

In addition to the specs, this implementation adds the following features:

  1. In addition to static ISO 639 language identifiers, object/@lang can also be XPath expressions, that return such language identifiers. This feature is handy for projects that set up language in their XML documents.
  2. By leaving away @prefix for a <vocabulary> or setting it to the empty string, the default namespace when evaluating XPath expressions binds to this vocabulary URI. Thus, when setting <vocabulary uri="http://www.tei-c.org/ns/1.0"/>, you can write XPaths like this: <object type="literal">//(teiHeader/fileDesc/titleStmt/title)[1]</object> without prefixing the element names. See test/config-02.xml for a self contained test case. Evaluating it on the reference implementation fails, while the implementation at hand processes it correctly.
  3. It is possible to use your own functions in the XPath expressionss in the <configuration> section: You can load an additional XSLT stylesheet by using the libraries (sequence of xs:anyURI) or libraries-csv (a string of comma separated URIs) stylesheet parameters. Please notice, that you have to declare your function's visibility non-private and non-hidden, e.g., @visibility=public, cf. XSLT 3.0 TR.
    target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:...  libraries-csv=$(realpath my-utils.xsl)

Collections

Due to not running inside an eXist database, the evaluation of the <collection> section of the configuration differs from the reference implementation. However, you can get full compatibility mode (see end of this section).

In contrast to the specs, /xtriples/collection/@uri is ignored, when a single XML source document is passed to the processor, i.e., when using xsl/extract.xsl or xsl/extract-param-dox.xsl.

When using xsl/extract-collection.xsl, it is evaluated as a Saxon collection URI. It can thus be a

  • directory URI with select pattern for finding files (relative URIs are resolved against the evaluated configuration file), or
  • zip-collection (zip, jar, docx) which will automatically be unpacked and crawled, or a
  • collection catalog listing files to crawl or
  • your own collection type provided you have written your own collection finder.

Link based resource crawling and literal resource crawling are supported exactly as in the reference implementation. In both modes, there is no @uri attribute present for the collection.

You can get full compatibility by setting the is-collection-uri stylesheet parameter to false. This way, all the @uri attribute of each <collection> is not read as a Saxon collection URI, but as a single document URI. Using this attribute, XPath based resource crawling with resources spread over multiple files is also supported.

You can evaluate the examples in test/gods with is-collection-uri=false and by using the XML catalog in test/catalog.xml, which maps lod academy URIs to local files:

target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:test/gods/conf-NN.xml -catalog:test/catalog.xml is-collection-uri=false

Output: NTriples

There's only one output format: NTriples. In a microservice architecture, converting to other formats is done in a converter service. NTriples is the RDF serialization of choice, because the response bodies of multiple request can simply be concatenated into one graph.

Development

Run tests with

target/bin/test.sh

or

source target/bin/classpath.sh # only once needed per shell session
ant -Dcatalog=test/catalog.xml test

License

This is distributed under the MIT license.

The tests cases directly in test/gods/ where taken from the original eXist-db implementation, which is licensed under the terms of the MIT license.

About

XTriples implementation in XSLT for local usage or deployment on a micro service

Resources

License

Stars

Watchers

Forks

Packages

No packages published