xtriples-micro
is an implementation of a
XTriples processor that works without
an eXist datebase.
XTriples? In XTriples, instead of writing specialized programs in XSLT, XQuery, Python, etc. for extracting RDF triples from XML documents, we write configuration files containing selectors. These config files are evaluated by an XTriples processor, which returns RDF triples. Here's an example of such a configuration file:
<?xml-model uri="https://xtriples.lod.academy/xtriples.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<xtriples>
<configuration>
<vocabularies>
<vocabulary prefix="gods" uri="https://xtriples.lod.academy/examples/gods/"/>
<vocabulary prefix="rdf" uri="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
<vocabulary prefix="rdfs" uri="http://www.w3.org/2000/01/rdf-schema#"/>
<vocabulary prefix="foaf" uri="http://xmlns.com/foaf/0.1/"/>
</vocabularies>
<triples>
<statement>
<subject prefix="gods">/@id</subject>
<predicate prefix="rdf">type</predicate>
<object prefix="foaf" type="uri">Person</object>
</statement>
<statement>
<subject prefix="gods">/@id</subject>
<predicate prefix="rdfs">label</predicate>
<object type="literal" lang="en">/name/english</object>
</statement>
<statement>
<subject prefix="gods">/@id</subject>
<predicate prefix="rdfs">label</predicate>
<object type="literal" lang="gr">/name/greek</object>
</statement>
<statement>
<subject prefix="gods">/@id</subject>
<predicate prefix="rdfs">seeAlso</predicate>
<object type="uri">/concat("http://en.wikipedia.org/wiki/", $currentResource/name/english)</object>
</statement>
</triples>
</configuration>
<collection uri="?select=[0-9]+.xml">
<resource uri="{//god}"/>
</collection>
</xtriples>
While the original XTriples processor requires an eXist database and applies a configuration only on the fixed set of XML files contained in it, the implementation at hand runs outside of a database, e.g., on a local set of documents. It can also be deployed on the famous SEED XML Transformer. This deployment gives you a lightweight microservice, where you can send a single XML document and a config file to and get RDF triples in return.
TODO
This project offers an Oxygen framework, that assists writing XTriples configuration files and also provides transformation scenarios for applying a configuration to a single or a collection of documents. Installation is as simple as using the following installation link in the installation dialog found in Help -> Install New Addons:
https://scdh.github.io/xtriples-micro/descriptor.xml
See detailed description in the Wiki!
For using the XTriples engine in CI/CD pipelines or in downstream projects, installation of a released package is the way to go. The Wiki gives detailed instructions!
For playing around with XTriples and validating that it is suitable
technology, you can also clone this repository. It comes with a fully
reproducible tooling environment
that installs all tools needed for running and testing in a
sandbox. You only need a Java development kit (JDK) installed. On
debian-based systems, you can install it with sudo apt install openjdk
.
To set up the tooling environment, clone this repository, cd
into
your working copy and run:
./mvnw package # Linux
or
mvnw.cmd package # Windows
This will download Saxon-HE etc. and generate wrapper files, that set up the classpath for using them.
After running the command above, the wrapper scripts are in
target/bin/
. E.g., there are a wrappers around
Saxon-HE
and Jena RIOT:
target/bin/xslt.sh -?
target/bin/riot.sh -h
There are XSLT stylesheets, that do the work of evaluating an XTriples configuration file and applying it to XML documents.
xsl/extract.xsl
extracts
from an XML document given as source by applying a configuration
passed in via the stylesheet parameter config-uri
.
target/bin/xslt.sh -xsl:xsl/extract.xsl -s:test/gods/1.xml config-uri=$(realpath test/gods/configuration.xml)
The output should look like this:
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/2000/01/rdf-schema#label> "Aphrodite"@en .
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/2000/01/rdf-schema#label> "Ἀφροδίτη"@gr .
<https://xtriples.lod.academy/examples/gods/1> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://en.wikipedia.org/wiki/Aphrodite> .
If your result is polluted with debug messages, you can append 2> /dev/null
to silence them or use Saxon's -o:
option to send the
output to a file. They are printed to stderr.
If you want an other format, pipe the result to Jena RIOT like so:
target/bin/xslt.sh -xsl:xsl/extract.xsl -s:test/gods/1.xml config-uri=$(realpath test/gods/configuration.xml) | target/bin/riot.sh --out rdf/xml
Here's the result:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:j.0="http://xmlns.com/foaf/0.1/" >
<rdf:Description rdf:about="https://xtriples.lod.academy/examples/gods/1">
<rdfs:seeAlso rdf:resource="http://en.wikipedia.org/wiki/Aphrodite"/>
<rdfs:label xml:lang="gr">Ἀφροδίτη</rdfs:label>
<rdfs:label xml:lang="en">Aphrodite</rdfs:label>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
</rdf:Description>
</rdf:RDF>
This is the only transformation that makes sense deploying on a micro service. See seed.
xsl/extract-doc-param.xsl
takes a
configuration as source document and applies it to the collecton of
XML documents given in /xtriples/collection/@uri
, which is
interpreted as a Saxon collection URI. See section Implementation of
the Specs for details. This is
compatible to the reference implementation.
Example:
target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:test/gods/configuration.xml
This will extract triples from all the God files in
test/gods
due to the collection URI <collection uri="?select=[0-9]+.xml">
. It is a relative URI (current directory
.
), and the select
query
string
is interpreted by the Saxon processor.
xsl/extract-doc-param.xsl
takes a
configuration as source document and applies it to an XML document
referenced by the source-uri
stylesheet parameter.
target/bin/xslt.sh -xsl:xsl/extract-param-doc.xsl -s:test/gods/configuration.xml source-uri=$(realpath test/gods/1.xml)
- The content of
<subject>
,<predicate>
,<object>
and<condition>
is evaluated as an XPath expression, if and only if the content starts with a slash/
. Before the expression is evaluated, it is prepended with$currentResource
(or$externalResource
respectively). E.g.,/@id
is evaluated as$currentResource/@id
. In<condition>
the XPath is constructed like this:xs:boolean($currentResource CONDITION )
. - Keep the difference of document vs. resource in mind: Each
document may contain multiple resources if
/xtriples/collection/resource/@uri
is used to unnest resources from a document. The variable$currentResource
and$resourceIndex
provide access to the resource and its index. - This resource context is transparent to the underlying
document. Thus, accessing parts of the document outside of the
context subtree is possible:
$currentResource/ancestor::TEI/teiHeader
. - The XPath evaluation uses namespaces made up from the
prefix-to-URI mapping from the
<vocabularies>
section of the configuration file. Thus:- If you want to extract RDF from non-namespace XML sources, do not use the empty string prefix in the vocabularies, since that would bind the default namespace for XPath evaluation to this vocabulary URI.
- Be careful about using the default namespace, since it is not compatible with the reference implementation. See below!
- Using BNodes may be a bit tricky. See these hints.
This is a full implementation of the XTriples spec.
In addition to the specs, this implementation adds the following features:
- In addition to static ISO 639 language identifiers,
object/@lang
can also be XPath expressions, that return such language identifiers. This feature is handy for projects that set up language in their XML documents. - By leaving away
@prefix
for a<vocabulary>
or setting it to the empty string, the default namespace when evaluating XPath expressions binds to this vocabulary URI. Thus, when setting<vocabulary uri="http://www.tei-c.org/ns/1.0"/>
, you can write XPaths like this:<object type="literal">//(teiHeader/fileDesc/titleStmt/title)[1]</object>
without prefixing the element names. Seetest/config-02.xml
for a self contained test case. Evaluating it on the reference implementation fails, while the implementation at hand processes it correctly. - It is possible to use your own functions in the XPath expressionss
in the
<configuration>
section: You can load an additional XSLT stylesheet by using thelibraries
(sequence of xs:anyURI) orlibraries-csv
(a string of comma separated URIs) stylesheet parameters. Please notice, that you have to declare your function's visibility non-private and non-hidden, e.g.,@visibility=public
, cf. XSLT 3.0 TR.target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:... libraries-csv=$(realpath my-utils.xsl)
Due to not running inside an eXist database, the evaluation of the
<collection>
section of the configuration differs from the reference
implementation. However, you can get full compatibility mode (see end
of this section).
In contrast to the specs, /xtriples/collection/@uri
is ignored,
when a single XML source document is passed to the processor, i.e.,
when using xsl/extract.xsl
or xsl/extract-param-dox.xsl
.
When using xsl/extract-collection.xsl
, it is evaluated as a Saxon
collection
URI. It
can thus be a
- directory URI with select pattern for finding files (relative URIs are resolved against the evaluated configuration file), or
- zip-collection (zip, jar, docx) which will automatically be unpacked and crawled, or a
- collection catalog listing files to crawl or
- your own collection type provided you have written your own collection finder.
Link based resource crawling and literal resource crawling are
supported exactly as in the reference implementation. In both modes,
there is no @uri
attribute present for the collection.
You can get full compatibility by setting the is-collection-uri
stylesheet parameter to false
. This way, all the @uri
attribute of
each <collection>
is not read as a Saxon collection URI, but as a
single document URI. Using this attribute, XPath based resource
crawling with resources spread over multiple files is also supported.
You can evaluate the examples in test/gods
with
is-collection-uri=false
and by using the XML catalog in
test/catalog.xml
, which maps lod academy URIs to local files:
target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:test/gods/conf-NN.xml -catalog:test/catalog.xml is-collection-uri=false
There's only one output format: NTriples. In a microservice architecture, converting to other formats is done in a converter service. NTriples is the RDF serialization of choice, because the response bodies of multiple request can simply be concatenated into one graph.
Run tests with
target/bin/test.sh
or
source target/bin/classpath.sh # only once needed per shell session
ant -Dcatalog=test/catalog.xml test
This is distributed under the MIT license.
The tests cases directly in test/gods/
where taken from the
original eXist-db
implementation,
which is licensed under the terms of the MIT license.