partial-SPIDER

A partialized version of the SPIDER Algorithm for inclusion dependency detection. The original SPIDER Algorithm was published by Bauckmann et al. in 2009. The authors discuss the application of partial inclusion dependency (pIND) discovery in section 6 of the paper. They outline a motivation and distinguish between two versions of pIND detection.

The number of all distinct, not included values expressed as a percentage. Meaning that the distribution of duplicates is disregarded.
The absolute number of not included values. Meaning the number of entry changes one has to conduct, to clean the pIND to a proper IND.

The authors propose a counting approach for the first version which only affects a small part of their pIND validation code. For the second version the proposed solution is 'To obtain the absolute number of not included values we need the number of occurrences for each value in the dependent attributes, which can be extracted from the database'. No more attention is given to the problem. In a setting, where the input is a file, there is no easy way to get the number of occurrences for each value, especially if attributes do not fit into main memory.

My implementation solves this issue by treating the number of occurrences as an essential part of any entry. This enables us to answer both versions of the question through the same concept.

Hardware

To set the numbers I state below, it is important to clarify the hardware used.

CPU: AMD Ryzan 5 3600X
RAM: 4x8GB 1600MHz DDR4
OS: Windows 10 Pro
Storage: Samsung 850 EVO SSD

Advantages of this implementation

The original implementation of Bauckmann appears to be lost. This causes me to use the implementation offered by Dürsch et al. through their paper Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. The code for their SPIDER implementation can be found on github.

Something about algo runtime

Averaging the execution time of the implementation of Dürsch et al. from 10 executions for the TCP-H dataset (1GB) yields a time of 131.95 seconds. This will be used as a baseline for further claims. TODO: compare runtime TODO: Make plot

Something about clean code

Ability to detect partial/dirty INDs

There is no other known implementation which offers pIND discovery via the SPIDER Algorithm. The user can set a threshold which is then used to execute the algorithm and find all pINDs that have at most that percentage of violations (using version 1 or 2).

Null value treatment

This implementation also offers a variety of null value treatments, such that user can pick the version which fits their needs best. TODO: Link main pIND repo

Disadvantages of this implementation

Currently there is no integration into a framework like Metanome. A user would have to execute the algorithm without a visual interface. A Database input is not possible. Meaning a user would need to create a file dump before the execution is possible. These inconveniences exist, since this project only exists as a comparison algorithm for my master thesis. Code adjustments, which would enable such integrations, should be fairly easy.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
baseSPIDER		baseSPIDER
experiments/initial_multithreading		experiments/initial_multithreading
pSPIDER		pSPIDER
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

partial-SPIDER

Hardware

Advantages of this implementation

Something about algo runtime

Something about clean code

Ability to detect partial/dirty INDs

Null value treatment

Disadvantages of this implementation

About

Uh oh!

Uh oh!

Languages

Jakob-L-M/partial-SPIDER

Folders and files

Latest commit

History

Repository files navigation

partial-SPIDER

Hardware

Advantages of this implementation

Something about algo runtime

Something about clean code

Ability to detect partial/dirty INDs

Null value treatment

Disadvantages of this implementation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages