Skip to content

A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.

License

Notifications You must be signed in to change notification settings

delftdata/valentine

Repository files navigation

Valentine: Evaluating Matching Techniques for Dataset Discovery

Project page: https://delftdata.github.io/valentine/

This is the main repository that contains the framework used in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery. The data generator and the framework's output and visualizations are on the following repositories:

Data generator: valentine-generator

Paper results and visualizations: valentine-paper-results

The datasets used for experiments in Valentine can be found in the datasets-archive.

Installation instructions

The following instructions have been tested on a newly created Ubuntu 18.04 LTS VM. If you prefer to run the entire suite on docker, skip this and the Run experiments sections and go directly to the Run with docker section.

  1. Clone the repo to your machine using git git clone https://github.com/delftdata/valentine-suite
  2. To install all the dependencies required by the suite, run the install-dependencies.sh script.

NOTE: This script installs programs and hence requires sudo rights in some parts

After these two steps, the framework should not require anything more regarding dependencies.

Run experiments

  1. Download the data from the datasets-archive and put them into a folder called data on the project root level.

  2. Set the grid-search configuration that you want to run for all the algorithms in the file algorithm_configurations.json

  3. Activate the conda environment created in the installation phase with the following command conda activate valentine-suite and run the generate_configuration_files.py script with the command python generate_configuration_files.py. This will create all the configuration files that specify a schema matching job (Run a specific method with specific parameters on a specific dataset).

NOTE: if your system does not find conda you might need to run source ~/.bashrc

  1. To run the schema matching jobs in parallel run the script run_experiments.sh with the command ./run_experiments.sh {method_name} {number_of_parallel_jobs} e.g. to run 40 Cupid jobs concurrently run ./run_experiments.sh Cupid 40 (This would require a 40 CPU VM to run smoothly). The output will be written in the output folder at the project root level.

Run with docker

The entire suite is also available as a docker image with name kpsarakis/valentine-suite:1.0. The steps to run with docker are the following:

  1. Run the following command sudo docker run --privileged=true -it -v /var/run/docker.sock:/var/run/docker.sock kpsarakis/valentine-suite:1.0 this will download the image and start a shell on the image containing the valentine suite.

  2. Activate the conda environment by running conda activate valentine-suite

  3. Go into the folder of the suite using cd /home/valentine-benchmark

  4. Now you are able to run the suite with the data used in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery by running ./run_experiments.sh {method_name} {number_of_parallel_jobs} e.g. to run 40 Cupid jobs concurrently run ./run_experiments.sh Cupid 40 (This would require a 40 CPU VM to run smoothly). The output will be written in the output folder in the project root level, i.e. \home\valentine-benchmark\output.

Integrate new methods

Since Valentine is an experiment suit, it is designed to be extended with more schema matching methods. To extend Valentine with such methods, please visit the following wiki guide on how to do so.

Project structure

Cite Valentine

@misc{koutras2021valentine,
      title={Valentine: Evaluating Matching Techniques for Dataset Discovery}, 
      author={Christos Koutras and George Siachamis and Andra Ionescu and Kyriakos Psarakis and Jerry Brons and Marios Fragkoulis and Christoph Lofi and Angela Bonifati and Asterios Katsifodimos},
  booktitle = {37th IEEE International Conference on Data Engineering, ICDE 2021},
  pages     = {1--12},
  publisher = {IEEE},
  year      = {2021}
}

About

A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages