Skip to content

LUH-DBS/Blend

Repository files navigation

Blend: A Unified Data Discovery System

Here you can find the code for the "Blend: A Unified Data Discovery System" paper.

Abstract

Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary to enable the user to discover the desired tables.

This paper presents Blend, a comprehensive data discovery system that empowers users to develop ad-hoc discovery tasks without the need to develop new algorithms or build a new index structure. To achieve this goal, we introduce a general index structure capable of addressing multiple discovery queries. We develop a set of lower-level operators that serve as the fundamental building blocks for more complex and sophisticated user tasks. These operators are highly efficient and enable end-to-end efficiency.

To enhance the execution of the discovery pipeline, we rewrite the search queries into optimized SQL statements to push the data operators down to the database. We demonstrate that our holistic system is able to achieve comparable effectiveness and runtime efficiency to the individual state-of-the-art approaches specifically designed for a single task.

Installation

The code was tested on Python 3.10.9. To install the required packages run the following command:

python -m pip install -r requirements.txt

Index generation

To build the index you can use the create_index.py file. Before executing it you must change the required parameters in the file.

PATH = 'data/*.csv' # The path where the csv files are located
INDEX_NAME = 'blend_index' # The name of the created index for Blend

dbcon = vertica_python.connect(
        port=5433,
        host='db.example.com',
        user='username',
        password='password',
        database='vdb',
        session_label='some_label',
        read_timeout=60000,
        unicode_error='strict',
        ssl=False,
        use_prepared_statements=False
)

Database configuration

To run, Blend needs a database connection. To configure the database connection you need to create a config file in the config folder. The config file must be named config.ini. Depending on the DBMS you are using you need to change the config file. Below you can find an example of a config file for Vertica.

[Database]
dbms=vertica
host=db.example.com
port=5433
user=username
password=password
dbname=vdb

index_table=blend_index

Example

Underneath you can find the examples of the plans used in the paper.

Union plan

def UnionPlan(dataset, k=10):
    plan = Plan()
    input_element = Input(dataset)
    plan.add('input', input_element)
    for clm_name in dataset.columns:
        element = Seekers.SC(dataset[clm_name], k)
        plan.add(clm_name, element, ['input'])
    plan.add('counter', Combiners.Counter(K), dataset.columns)
    plan.add('terminal', Terminal(), ['counter'])

Augumentation by example plan

def AugmentationByExamplePlan(examples, queries, K=10):
    plan = Plan()
    inputs = Input([examples, queries])
    plan.add('input', inputs)
    examples_seeker = Seekers.MC(examples, K)
    plan.add('example', examples_seeker, ['input'])
    query_seeker = Seekers.SC(queries, K)
    plan.add('query', query_seeker, ['input'])
    plan.add('combiner', Combiners.Intersection(K), ['example', 'query'])
    plan.add('terminal', Terminal(), ['combiner'])
df = pd.read_csv('dataset.csv')
aug = CreateAugmentationPlan(df[['E1', 'E2']], df['Q'])
aug.run()

Experiments

Runtime Break Down

Runtime break down experiment

In this section, we assess the runtime distribution for each of the evaluated operations. We divide the execution of each search plan into three computational components, namely, DB, which represents the amount of time spent on database query execution, Load, which shows how much time it takes to load the database query results into the main memory, and Mem, representing the time spend for processing the data in the main memory. Our goal in Blend is to reduce the loading time by moving most of the computation into the database to prevent expensive data movement. By doing so, we can achieve a runtime efficiency that is comparable to the stand-alone state-of-the-art baselines. Table 6 shows the results of this experiment. As the union discovery baseline, i.e., Starmie, leverages a large language model, this baseline does not use any database to store the table values. In this case, the loading time represents the average time the approach takes to generate the embeddings from the large language model for the query tables. In all discovery tasks except the union discovery, where DB is not defined for baseline, Blend consumes a higher percentage of the time during in-database query execution compared to the baselines. On average Blend utilizes the database up to 45% more than the baselines. This results in a drastic reduction of the loading time. Blend’s loading time is negligible in all cases except in the MC join discovery. Blend requires more time to load the data for MC join because, to efficiently prune of the false positives, Blend requires to read the posting list, including the super keys. These posting lists, depending on the number of candidate tables, require additional fetching time. However, the SQL query used in Blend reduces the number of candidate tables, therefore, it reduces the loading overhead from 68.2% in MATE to 14.3%.

BLEND optimizer VS. Postgres and Vertica

As augmentation-by-example leverages various seekers, we evaluated the performance of our query rewriter in the execution engine compared to two baselines that only use the native DBMS optimizer: executing queries independently and then merging the results, and modeling the operator sequence with subquery formulations. According to our experiments on both commercial column store and PostgreSQL, our query rewriter is able to achieve up to 28% and 27% runtime reduction compared to the baselines mentioned above respectively.

About

The repository for Blend, a holistic data discovery system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages