This application provides two parts:
- A CLI tool which allows you to convert a dataset to a fragmented version following the LDES and TREE specifications.
- A webservice which allows you to query a fragmented dataset and add resources to it.
Run npm install
in order to install the necessary libraries.
In order to use the fragmentation CLI tool, this application provides the fragment-dataset
npm script.
The CLI tool expects the following options:
--config <config_file.json>
: this required option allows you to provided a json configuration file which describes how your dataset should be interpreted.--fragmenter <fragmenter>
: the type of fragmenter which should be used to fragment the provided dataset, the default fragmenter is:time-fragmenter
.--output <output_folder>
: the output folder in which the fragmented version of the provided dataset should be stored.
As a parameter, the CLI tool expects a file containing the dataset which should be fragmented.
Currently, the CLI tool supports three different types of input datasets:
- A
.csv
dataset. When using a csv file as input dataset, the provided json configuration file should contain how the different csv fields should be mapped to RDF predicates. - An index dump of the IPFS Search engine (https://ipfs-search.com/) which can retrieved from https://github.com/ipfs-search/ipfs-search-extractor. When using this type of dataset you can sepcify the predicates which should be used in the json config file.
- A standard text-file. When using a standard text file as input dataset, the CLI tool will parse each line as a seperate resource and use the contents of the line as object values for a specified predicate in the json configuration file.
The json configuration file should always contain the following key/value pairs:
stream: <stream_uri>
: the URI of the stream in which the resources of the dataset should be stored.resourceType: <resource_type_uri>
: the URI describing the type which should be attributed to the resources.resourceIdPrefix: <resource_id_prefix_uri
: the URI which should be prepended to the resource ID's.
When using a standard text file, the configuration file should also contain the following key/value pair:
propertyType: <predicate_uri>
: this describes the predicate which will be used in combination with the resource value.
When using a csv dataset as input, the configuration file should additionally contain the following options:
resourceIdField: <resource_id_field>
: this option allows you to specify the csv field which should be used as the resource id.propertyMappings: <property_mappings>
: this is a json object describing how the different csv fields should be mapped to RDF predicates of the resource.
When using an IPFS export as input, the configuration file should also contain the following key/value pair:
propertyType: <predicate_uri>
: this describes the uri which will be prepended to the hashes contained in the IPFS dump.
This repository contains a default-config.json
which is a configuration suitable for loading a dataset containing a list of streetnames. In order to run the CLI tool with this configuration, execute npm run fragment-dataset -- -c default-config.json -o data/streetnames -f prefix-tree-fragmenter <streetname_dataset>
.
The resulting dataset will be stored in the data/streetnames
folder. This example applies a prefix-tree-fragmenter to the dataset, but you can also replace this by a time-fragmenter.
This repository also contains a csv-config.json
as an example on how to write a configuration when fragmenting a csv dataset. This configuration can be used in combination with a GroupLens MovieLens dataset, such as https://files.grouplens.org/datasets/movielens/ml-latest-small.zip. The fragmenter can be applied on the movies.csv
file supplied in dataset archive. You can run the CLI tool with this configuration by executing npm run fragment-dataset -- -c csv-config.json -o data/movies -f prefix-tree-fragmenter <movie_dataset.csv>
. The resulting dataset will be stored in the data/movies
folder.
After having fragmented a dataset, you can query it using the web service. The web service in this repository can be run locally or using the supplied docker configuration:
-
In order to run the service locally, run the
npm start
script. The service will be exposed on port 3000. -
Using the supplied docker-compose and docker-compose.dev configurations, you can run the service in a docker container. By default, the service will be exposed on port 8888, you can configure how the data is mounted in the compose files.
Using the DATA_FOLDER
environment variable, you can configure in which base data folder the service will look for the fragmented datasets.
The web service provides the following two endpoints:
-
GET /:folder/:subfolder?/:nodeId
: this endpoint allows you to query a specific node represented in an RDF format to your liking. Using the HTTP Accept header, you can provide which representation of the data you would like to receive. Typically, the view node of the dataset is located in/:folder/1
while the other nodes are additionally stored in subfolders. -
POST /:folder
allows you to add a new resource to a dataset located infolder
. The post body can containing the resource in any RDF format to your liking. The post body format should be supplied using theContent-Type
HTTP header. This endpoints expects the following query parameters:resource: <resource_id>
: the id of the resource which is to be addedfragmenter: <fragmenter_type>
: the type of fragmenter to use, defaults totime-fragmenter
. The other option isprefix-tree-fragmenter
.stream: <stream_id>
: the id of the stream the resource should be stored in.relation-path: <relation_path>
: the path on which the relations are defined.
In order to add a movie to the movie-dataset, you can execute the following POST request: POST /movies?resource=https://example.org/movies/800005&fragmenter=prefix-tree-fragmenter&stream=http://mu.semte.ch/services/ldes-prefix-fragmenter/movies&relation-path=https://example.org/name
.
The body can for example contain a turtle document:
<https://example.org/movies/800005> a <https://example.org/Movie>;
<https://example.org/name> "My new movie";
<https://example.org/genres> "Comedy|Romance|Action".
In this case, the supplied Content-Type header should be text/turtle
.
fragmenters/Fragmenter.ts
provides a base abstract class which can be used to implement new types of fragmenters. It provides the following methods:
constructNewNode
constructs a new node with an incremented idfileForNode
returns the file location of a node given a node iddetermineSubFolder
determines the folder the node should be located inshouldCreateNewPage
determines whether a node has reached its limit in terms of resource countaddResource
is an abstract method which is responsible to add a new resource to a dataset and is the main method which should be implemented by new fragmenters.
fragmenters/TimeFragmenter.ts
implements a versioning time-based LDES fragmenter. When adding a resource to a dataset using this fragmenter, it adds a timestamp and a version to the resource. The relations it uses are GreaterThanRelations.
fragmenters/PrefixTreeFragmenter.ts
stores resources in a prefix-tree based dataset. The relations it uses are PrefixRelations.
The cache stores a list of nodes in combination with their modified
status. It also provides a flush
function which is responsible for writing back all modified nodes.
Dataset transformers provide a way in order to convert Readable text streams to a stream of Resource
instances each containing an id and a triplestore. dataset-transformers/dataset-transformer.ts
provides an interface containing a transform
method which emits a stream of resources based on a readable stream and a configuration.
default-transformer.ts
is an implementation which is to be used for ordinary text files and converts each line to a resource.
csv-transformer.ts
reads each line of a provided csv filestream and converts it based on a csv json configuration.
These transformers are mainly used by the CLI tool which can be found in fragment-dataset.ts
Prefix tree implementation based on https://github.com/Dexagod/linked_data_tree.