We are thrilled to present a new approach to analyze Big Data by accessing the smallest quantity of data possible: the Sommelier Sampling.
With the raise of Big Data and Analytics, the demand for optimization on the way data is handled is growing every day. Executing complex data analytics queries on ever increasing datasets costs time and money, access to data will become more expensive and will induce to memory walls, being sampling techniques the perfect solution.
Sampling is a powerful but also feared technique for approximating query answers. The main issue about sampling in large data platforms is that it does not offer sizable savings with only a small effect on the answer quality and the error estimation is still challenging.
Some experts have written reports to demonstrate its benefits, along with rules and calculations to measure error estimation. From this point, we propose to study and implement an open-sourced version of such studies.
In the spec.ipynb notebook we present a few examples to better understand the different types of samples that exist and when they should be used. We create synthetic datasets, normally distributed data and skewed data affects query results when sample techniques are used. We will see how the cardinality between data is key.
We look forward for your feedback! Any project-structure convenience or doubt that you have, make sure to let us know! The issues are open.
Also, if you want to contribute, don't hesitate to contact with us!