Follow the link https://www.postgresql.org/download/ to download and install the suitable distribution of the database for your platform.
- Open the TPC webpage following the link: https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
- In the
Active Benchmarks
table (first table), follow the link ofDownload TPC-H_Tools_v3.0.1.zip
, it'll redirect toTPC-H Tools Download
page - Give your details and click
download
, it'll email you the download link. Use the link to download the zip file. - Unzip the zip file, and it must have the
dbgen
folder among the extracted contents
- Download the code
tpch-pgsql
from the link: https://github.com/Data-Science-Platform/tpch-pgsql/tree/master. - Follow the
tpch-pgsql
project Readme to prepare and load the data. - (In case the above command gives error as
malloc.h
not found, showing the filenames, go inside dbgen folder, open the file and replacemalloc.h
withstdlib.h
)
TPCH 100MB (sf=0.1) data is provided at: https://github.com/ahanapradhan/UnionExtraction/blob/master/mysite/unmasque/test/experiments/data/tpch_tiny.zip
The load.sql file in the folder needs to be updated with the corresponding location of the data .csv files.
https://duckdb.org/docs/extensions/tpch.html
A developement environment for python project is required next. Here is the link to PyCharm Community Edition: https://www.jetbrains.com/pycharm/download/ (Any other IDE is also fine)
- Python 3.8.0 or above
django==4.2.4
sympy==1.4
psycopg2==2.9.3
numpy==1.22.4
The code is organized into the following directories:
The mysite
directory contains the main project code.
Inside unmasque
, you'll find the following subdirectories:
The src
directory contains code that has been refactored from the original codebase developed in various theses, as well as newly written logic, often designed to simplify existing code. This may include enhancements or entirely new functionality.
The test
directory houses unit test cases for each extractor module. These tests are crucial for ensuring the reliability and correctness of the code.
Please explore the individual directories for more details on the code and its purpose.
inside mysite
directory, there are two files as follows:
pkfkrelations.csv --> contains key details for the TPCH schema. If any other schema is to be used, change this file accordingly.
config.ini --> This contains database login credentials and flags for optional features. Change the fields accordingly.
database
section: set your database credentials.
support
section: give support file name. The support file should be present in the same directory of this config file.
logging
section: set logging level. The developer mode is DEBUG
. Other valid levels are INFO
, ERROR
.
feature
section: set flags for advanced features, as the flag names indicate. Included features are, UNION
, OUTER JOIN
, <>
or !=
operator in arithmetic filter predicates and IN
operator.
options
section: extractor options. E.g. the maximum value for LIMIT
clause is 1000. If the user needs to set a higher value, use limit=value
.
Open mysite/unmasque/src/main_cmd.py
file.
This script has one default input specified.
Change this query to try Unmasque for various inputs.
test.util
package has queries.py
file, containing a few sample queries. Any of them can be used for testing.
Change the current directory to mysite
.
Use the following command:
python -m unmasque.src.main_cmd
the main
function in main_cmd.py can be run from the IDE.
(Current code uses relative imports in main_cmd.py script. If that causes import related error while trying to run from IDE, please change the imports to absolute.)
In the terminal, go inside unmasque
folder and start the Django app using the command: python3 manage.py runserver
Once the server is up at the 8080 port of localhost, the GUI can be accessed through the link: http://localhost:8080/unmasque/