A partialized version of the SPIDER Algorithm for inclusion dependency detection. The original SPIDER Algorithm was published by Bauckmann et al. in 2009. The authors discuss the application of partial inclusion dependency (pIND) discovery in section 6 of the paper. They outline a motivation and distinguish between two versions of pIND detection.
- The number of all distinct, not included values expressed as a percentage. Meaning that the distribution of duplicates is disregarded.
- The absolute number of not included values. Meaning the number of entry changes one has to conduct, to clean the pIND to a proper IND.
The authors propose a counting approach for the first version which only affects a small part of their pIND validation code. For the second version the proposed solution is 'To obtain the absolute number of not included values we need the number of occurrences for each value in the dependent attributes, which can be extracted from the database'. No more attention is given to the problem. In a setting, where the input is a file, there is no easy way to get the number of occurrences for each value, especially if attributes do not fit into main memory.
My implementation solves this issue by treating the number of occurrences as an essential part of any entry. This enables us to answer both versions of the question through the same concept.
To set the numbers I state below, it is important to clarify the hardware used.
- CPU: AMD Ryzan 5 3600X
- RAM: 4x8GB 1600MHz DDR4
- OS: Windows 10 Pro
- Storage: Samsung 850 EVO SSD
The original implementation of Bauckmann appears to be lost. This causes me to use the implementation offered by Dürsch et al. through their paper Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. The code for their SPIDER implementation can be found on github.
Averaging the execution time of the implementation of Dürsch et al. from 10 executions for the TCP-H dataset (1GB) yields a time of 131.95 seconds. This will be used as a baseline for further claims. TODO: compare runtime TODO: Make plot
There is no other known implementation which offers pIND discovery via the SPIDER Algorithm. The user can set a threshold which is then used to execute the algorithm and find all pINDs that have at most that percentage of violations (using version 1 or 2).
This implementation also offers a variety of null value treatments, such that user can pick the version which fits their needs best. TODO: Link main pIND repo
Currently there is no integration into a framework like Metanome. A user would have to execute the algorithm without a visual interface. A Database input is not possible. Meaning a user would need to create a file dump before the execution is possible. These inconveniences exist, since this project only exists as a comparison algorithm for my master thesis. Code adjustments, which would enable such integrations, should be fairly easy.