Data Citation for CSV Data

This tool allows researchers to create citable subsets from CSV data. Arbitrary CSV files can be uploaded and a Web interface supports users at selecting specific subsets. The system automatically monitors the subset creation process and assigns a persistent identifier (PID) to the subset. CSV data sets can be updated (records can be added, deleted or modified) and all data is versioned. Thus earlier versions of a specific subset can be retrieved even if the data is evolving.

Background: Data Citation

Sharing research data is becoming increasingly important as it enables peers to validate and reproduce data driven experiments. As researchers work with different data sets and highly specific individual subsets, the knowledge how subsets have been created is worth preserving. The most common way of creating a subset of a potentially large data set is to make a selection based on filtering out those records which are not needed and only include data which fulfils a given criteria.

The QueryStore

The generic approach to perform such filter operations is to use either use a query language which allows to describe which data to be included and which records should be omitted. SQL is such a general purpose query language. For end users, interfaces exist which can hide the complexity of query languages by providing forms where users can make their selections visually or by entering appropriate values. The QueryStore stored these queries (thus the name) and makes them available for later reuse. Whenever a reseacher uses a query (via an interface) in order to create a subset, the parameters, their order and the sortings applied can be stored. Based on temporal metadata which is collected automatically, the query can be re-executed.

Query Store Features:

The Query store is implemented in Java and provides an API for the most common Tasks. It uses Hibernate to store the entities (details below).

Create new queries
Add metadata about users
Add descriptive text
Attach the persistent identifier of the data source (details below)
Generate a persistent identifier for the query itself (details below)
Append a hash of the result set
Calculate a hash of the query to detect duplicate queries
Add arbitrary filters in a key value fashion (e.g. 'instrumentName';'tempreatureSensor'). This is used to mapp the interface input fields and their respective values which have been entered.
Add arbitrary many sortings either in ascending or desceinding order
Detect query duplicates
Create timestamps automatically for each insert and update.
Full audit log.

Persistent Identifaction Service

Identifying datasets is essential for future reuse. Storing exported datasets on a Web server and providing the URL for its retrieval is not enough. A simle change in the directory structure of the server renders a URL obsolete and the dataset can not be retrieved again. Hence a more sophisticated way of referencing datasets is needed. The concept of persistent identifiers (PID) deals with this problem. Several different PID systems exist and they all follow similar principles. A resolver service is introduced which can resolve an identifier to its URL. The actual URL may change during the course of time, but the identifier remains always the same (it is persistent). Thus whenever a digital object has to be moved to a different location (e.g.. server filesystem updates or similar events), its location may be updated and the identifier can then refer to its new location.

Persistent Identification Service Features

The Query store is implemented in Java and provides an API for the most common Tasks. It uses Hibernate to store the entities (details below). The following features are currently implemented in the API:

Create PIDs of the form prefix/identifier.
Create alphabetical identifiers of arbitrary length (e.g. zjqpU).
Create alphanumeric identifiers of arbitrary length (YouTube style, eg. qsn4zPVRA7a0).
Create numeric identifiers of arbitrary length (e.g. 08652).
Store one URI with each identifier
Update URIs (the identifier can neither be deleted nor updated via the API)
Create organizations and prefixes (prefixes are unique).
One organization can mit arbitrary many identifiers per prefix
Identifiers are uniqie within one prefix and therefore the complete PID is unique.
Resolve PIDs to URLs
Retrieve latest PID pre organization
Retrieve latest PID in the database.
Print details
Full audit log

CSV Data Citation

Users should be able to upload CSV files which they want to make citable. The CSV files are previously unkown to the server, therefore the files are analyzed and a new table schema is created automatically.

Workfows

The following secenarios are considered.

A new file

The user uploads a new file to the server. The user needs to specify a primary key. If no primary key can be specified, updates can not be detected automatically. The server then server creates a new table schema and adds metadata.

Appending data

The user appends data to an existing file. The file needs to have the same structure as the original file and only new records may be contained in the file. If a record is already there, an error is thrown.

Updating data

A user provides a file which contains updated records or new records. Existing records are identified via their primary key. When a primary key already exists in the database, the records gets updated and a new timestamp gets inserted.

Deleting records

The user uploads a file where rows have been deleted.

How to use this Code?

Dislaimer

This project is a research prototype. This means that this software is provided "as is", without any warranty of any kind. Feedback is welcome, but no support can be given at this point in time.

Software Stack

This project is written in Java and utilises amongst other things the following technologies:

Java 1.8
Maven 3
MySQL 5.6
Hibernate 4.3
HikariCP
Super CSV
Apache Tomcat
Apache Commons
Jersey
JSF
Primefaces
jbcrypt

Modules

The code is structured into several modules, each having its own dependencies.

Batch Modules

Provides a simple interface as alternative to the Web interface.

CSV-Citation-Deployment-Module

As the name implies, this module serves as a metamodule and allows to deploy the complete project within a Apache Tomcat Servlet container. The module also has subfolders containing the SQL data to initialize the database and create the necessary users and permissions. The script "kill_tomcat.sh" is used to launch the Tomca service (killing it if it was running). The scripts mvn_compile_simple and mvn_compile_server are user for compiling the project and uploading it to the Apache server.

CSVTools

This module contains helper tools for handling CSV data.

Database

The database module contains all classes related to database operations. The module consists of three packages:

Authentication

The authentication Module handles the user authentication for the Web application. It provides an API for creating new users. Passwords are stored encrypted.

DatabaseOperations

The operations module comes with different classes containing functions for storing and retrieving data in the MySQL database. It is the data backend of this tool.

Helpers

Simple helper classes, e.g. for String operations.

DatatableModel

This module handles all interactions with the datatable plugin from the Web interface.

Evaluation Module

This module contains the alternative approach based on Git. It is used for evaluating the database solution in terms of storage demand and query execution time.

Persistent Identification tools

The PiD module is a standalone PID service, which is used for creating test identifiers, which can be resolved locally.

Query Store

The query store contains all the queries and their parameters including the metadata of the query execution.

ResultsetVerification

The result set verification module computes hashes of result sets.

WebInterface

The web interface module provides a interface for users.

Name		Name	Last commit message	Last commit date
Latest commit History 558 Commits
.idea		.idea
CSV-Citation-Deployment-Module-Test/.idea		CSV-Citation-Deployment-Module-Test/.idea
CSV-Citation-Deployment-Module		CSV-Citation-Deployment-Module
CSVTools		CSVTools
Database		Database
DatatableModel		DatatableModel
Evaluation-Module		Evaluation-Module
PersistentIdentification		PersistentIdentification
QueryStore		QueryStore
ResultSetVerification		ResultSetVerification
WebInterface		WebInterface
additional_configuration		additional_configuration
.gitignore		.gitignore
CSV-DataCitation.iml		CSV-DataCitation.iml
GitHub-Subtree.txt		GitHub-Subtree.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Citation for CSV Data

Background: Data Citation

The QueryStore

Query Store Features:

Persistent Identifaction Service

Persistent Identification Service Features

CSV Data Citation

Workfows

A new file

Appending data

Updating data

Deleting records

How to use this Code?

Dislaimer

Software Stack

Modules

Batch Modules

CSV-Citation-Deployment-Module

CSVTools

Database

Authentication

DatabaseOperations

Helpers

DatatableModel

Evaluation Module

Persistent Identification tools

Query Store

ResultsetVerification

WebInterface

About

Uh oh!

Releases

Packages

Languages

datascience/RDA-WGDC-CSV-Data-Citation-Prototype

Folders and files

Latest commit

History

Repository files navigation

Data Citation for CSV Data

Background: Data Citation

The QueryStore

Query Store Features:

Persistent Identifaction Service

Persistent Identification Service Features

CSV Data Citation

Workfows

A new file

Appending data

Updating data

Deleting records

How to use this Code?

Dislaimer

Software Stack

Modules

Batch Modules

CSV-Citation-Deployment-Module

CSVTools

Database

Authentication

DatabaseOperations

Helpers

DatatableModel

Evaluation Module

Persistent Identification tools

Query Store

ResultsetVerification

WebInterface

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages