D3L Data Discovery Framework

Similarity-based data discovery in data lakes

This is the home of D3l data discovery framework: an approximate implementation of the ICDE 2020 paper with the same name.

Getting started

This is an approximate implementation of the D3L research paper published at ICDE 2020. The implementation is approximate because not all notions proposed in the paper are transferred to code. The most notable differences are mentioned below:

The indexing evidence for numerical data is different from the one presented in the paper. In this package, numerical columns are transformed to their density-based histograms and indexed under a random projection LSH index.
The distance aggregation function (Equation 3 from the paper) is not yet implemented. In fact, the aggregation function is customizable. During testing, a simple average of distances has proven comparable to the level reported in the paper.
The package uses similarity scores (between 0 and 1) instead of distances, as described in the paper.
The join path discovery functionality from the paper is not yet implemented. This part of the implementation will follow shortly.

Installation

You'll need Python 3.6.x to use this package.

pip install git+https://github.com/alex-bogatu/d3l

Installing from a specific release

You may wish to install a specific release. To do this, you can run:

pip install git+https://github.com/alex-bogatu/d3l@{tag|branch}

Substitute a specific branch name or tag in place of {tag|branch}.

Usage

See here for an example notebook.

However, keep in mind that this is a BETA version and future releases will follow. Until then, if you encounter any issues feel free to raise them here.

Contributing

All contributions must conform to PEP-8 and code style Black. This package adopts numpy style docstrings for in-code documentation. See the numpy GitHub repo for examples.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
d3l		d3l
examples/notebooks		examples/notebooks
resources/data		resources/data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D3L Data Discovery Framework

Getting started

Installation

Installing from a specific release

Usage

Contributing

About

Releases

Packages

Languages

License

alex-bogatu/d3l

Folders and files

Latest commit

History

Repository files navigation

D3L Data Discovery Framework

Getting started

Installation

Installing from a specific release

Usage

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages