Similarity-based data discovery in data lakes
This is the home of D3l data discovery framework: an approximate implementation of the ICDE 2020 paper with the same name.
This is an approximate implementation of the D3L research paper published at ICDE 2020. The implementation is approximate because not all notions proposed in the paper are transferred to code. The most notable differences are mentioned below:
- The indexing evidence for numerical data is different from the one presented in the paper. In this package, numerical columns are transformed to their density-based histograms and indexed under a random projection LSH index.
- The distance aggregation function (Equation 3 from the paper) is not yet implemented. In fact, the aggregation function is customizable. During testing, a simple average of distances has proven comparable to the level reported in the paper.
- The package uses similarity scores (between 0 and 1) instead of distances, as described in the paper.
- The join path discovery functionality from the paper is not yet implemented. This part of the implementation will follow shortly.
You'll need Python 3.6.x to use this package.
pip install git+https://github.com/alex-bogatu/d3l
You may wish to install a specific release. To do this, you can run:
pip install git+https://github.com/alex-bogatu/d3l@{tag|branch}
Substitute a specific branch name or tag in place of {tag|branch}
.
See here for an example notebook.
However, keep in mind that this is a BETA version and future releases will follow. Until then, if you encounter any issues feel free to raise them here.
All contributions must conform to PEP-8 and code style Black.
This package adopts numpy
style docstrings for in-code documentation. See the numpy GitHub repo for examples.