-
Notifications
You must be signed in to change notification settings - Fork 27
Add RFC for Presto -Native TPC-DS Connector #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
pdabre12
wants to merge
1
commit into
prestodb:main
Choose a base branch
from
pdabre12:native-tpcds-connector
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
# **RFC-0009 for Presto** | ||
|
||
## Presto - Native TPC-DS connector | ||
|
||
Proposers | ||
|
||
* Pratik Joseph Dabre | ||
* Pramod Satya | ||
|
||
## [Related Issues] | ||
|
||
Related issues: https://github.com/prestodb/presto/issues/22361 | ||
|
||
Related PRs: https://github.com/prestodb/presto/pull/23067 | ||
|
||
## Summary | ||
|
||
A native TPC-DS connector capable of generating in-memory data on the fly is proposed. | ||
|
||
## Background | ||
|
||
- TPC-DS Benchmarks: | ||
The TPC-DS benchmark is a well-known benchmark suite developed by the Transaction Processing Performance Council (TPC) for evaluating the performance of database systems handling complex queries and data processing workloads typical of data warehouses. | ||
1. Purpose: The TPC-DS benchmark tests a system's ability to handle a variety of SQL queries, including ad-hoc, reporting, OLAP (Online Analytical Processing) and data mining workloads. It focuses on measuring query performance, throughput and response times. It includes 99 pre-defined queries of varying complexity, including joins, aggregations, sub-queries and sorts which are representative of real-world analytics and reporting queries in DSS(Decision Support Systems) environments. | ||
2. TPC-DS Schema: The TPC-DS schema is designed to mimic a real-wold retail sales business and consists of 24 tables. The key elements of the schema are: | ||
- Fact Tables: These tables store large amounts of transactional data. | ||
- Dimension Tables: These tables store information about dimensions such as time, products, customers and locations. | ||
3. DSDGen: To generate data for the TPC-DS benchmark, the dsdgen program is used. It is a tool specifically built to generate large volumes of synthetic data that matches the structure and distributions defined by the TPC-DS specification. It generates realistic datasets at different scale factors, which allows the simulation of different database sizes. | ||
|
||
- Currently , Presto does not have a native implementation of the TPC-DS connector. This RFC proposes the addition of a new TPC-DS connector which will allow the generation of synthetic data on the fly confirming to the specifications provided by the TPC org. The new connector can be used as a Presto - Native catalog. | ||
|
||
### [Optional] Goals | ||
|
||
1. Add a TPC-DS connector to generate TPC-DS data in Presto native. | ||
2. Write end-to-end tests in Presto native with TPC-DS tables. | ||
|
||
### [Optional] Non-goals | ||
|
||
## Proposed Implementation | ||
|
||
The Presto - Native TPC-DS connector will be a wrapper for the generator distributed (dsdgen) by the TPC organization from C. This means we need our implementation to have the exact same behavior as the C implementation. DuckDB already has a TPC-DS connector of their own and they have wrapped the C files into C++ files, we are going to use these C++ files in our implementation. | ||
|
||
In the C++ implementation, there are two types of tables: source tables and target tables used for generation. Source table files are prefixed with "s_", while target table files are prefixed with "w_". For instance, there may be files like "s_call_center.c" and "w_call_center.c". Currently, our focus is solely on implementing functionalities for the target tables (w_ tables). | ||
|
||
In the target table files prefixed with “w_”, there are some helper functions(need to be implemented by us) precisely called as “append_row_start“ and “append_row_end“ which help in the row generation. Depending on the schema of the table, there will be “append_ “ functions depending on the data type to be appended. | ||
|
||
A new TPC-DS config `tpcds.toggle-char-to-varchar` will be added to toggle the char columns to varchar, addressing the lack of support for the char data type in Presto - Native. This config allows the toggling of the char to varchar when required, ensuring consistency between Presto - Java and Presto - Native. At the schema level, all the columns which were of char type would be cast to a varchar type. | ||
- `SHOW COLUMNS` on `call_center` table when the `tpcds.toggle-char-to-varchar` is set to false. | ||
``` | ||
presto:sf1> show columns in call_center; | ||
|
||
Column | Type | Extra | Comment | ||
-------------------+--------------+-------+--------- | ||
cc_call_center_sk | bigint | | | ||
cc_call_center_id | char(16) | | | ||
cc_rec_start_date | date | | | ||
cc_rec_end_date | date | | | ||
cc_closed_date_sk | integer | | | ||
cc_open_date_sk | integer | | | ||
cc_name | varchar(50) | | | ||
cc_class | varchar(50) | | | ||
cc_employees | integer | | | ||
cc_sq_ft | integer | | | ||
cc_hours | char(20) | | | ||
cc_manager | varchar(40) | | | ||
cc_mkt_id | integer | | | ||
cc_mkt_class | char(50) | | | ||
cc_mkt_desc | varchar(100) | | | ||
cc_market_manager | varchar(40) | | | ||
cc_division | integer | | | ||
cc_division_name | varchar(50) | | | ||
cc_company | integer | | | ||
cc_company_name | char(50) | | | ||
cc_street_number | char(10) | | | ||
cc_street_name | varchar(60) | | | ||
cc_street_type | char(15) | | | ||
cc_suite_number | char(10) | | | ||
cc_city | varchar(60) | | | ||
cc_county | varchar(30) | | | ||
cc_state | char(2) | | | ||
cc_zip | char(10) | | | ||
cc_country | varchar(20) | | | ||
cc_gmt_offset | decimal(5,2) | | | ||
cc_tax_percentage | decimal(5,2) | | | ||
(31 rows) | ||
``` | ||
|
||
- `SHOW COLUMNS` on `call_center `table when the `tpcds.toggle-char-to-varchar` is set to true. | ||
``` | ||
presto:sf1> show columns in call_center; | ||
Column | Type | Extra | Comment | ||
-------------------+--------------+-------+--------- | ||
cc_call_center_sk | bigint | | | ||
cc_call_center_id | varchar(16) | | | ||
cc_rec_start_date | date | | | ||
cc_rec_end_date | date | | | ||
cc_closed_date_sk | integer | | | ||
cc_open_date_sk | integer | | | ||
cc_name | varchar(50) | | | ||
cc_class | varchar(50) | | | ||
cc_employees | integer | | | ||
cc_sq_ft | integer | | | ||
cc_hours | varchar(20) | | | ||
cc_manager | varchar(40) | | | ||
cc_mkt_id | integer | | | ||
cc_mkt_class | varchar(50) | | | ||
cc_mkt_desc | varchar(100) | | | ||
cc_market_manager | varchar(40) | | | ||
cc_division | integer | | | ||
cc_division_name | varchar(50) | | | ||
cc_company | integer | | | ||
cc_company_name | varchar(50) | | | ||
cc_street_number | varchar(10) | | | ||
cc_street_name | varchar(60) | | | ||
cc_street_type | varchar(15) | | | ||
cc_suite_number | varchar(10) | | | ||
cc_city | varchar(60) | | | ||
cc_county | varchar(30) | | | ||
cc_state | varchar(2) | | | ||
cc_zip | varchar(10) | | | ||
cc_country | varchar(20) | | | ||
cc_gmt_offset | decimal(5,2) | | | ||
cc_tax_percentage | decimal(5,2) | | | ||
(31 rows) | ||
|
||
``` | ||
- Generation of tables: | ||
The TpcdsSplitManager in Presto partitions the data for a table into chunks known as splits, which will then be distributed among workers for processing. The number of splits processed by each worker is defined by the Tpcds connector config tpcds.splits-per-node. Each Presto split will be processed by the native worker using velox connector interfaces. First, the Presto TPC-DS split, column handle, and table layout will be converted to the corresponding velox ConnectorSplit, ColumnHandle, and ConnectorTableHandle respectively. The velox connector interfaces to add and process splits will be implemented for the TPC-DS connector. To generate TPC-DS data with this connector for instance, the connector interface DataSource, implemented as TpcdsDataSource, will be used. The addSplit API in TpcdsDataSource will consume a TpcdsSplit, and generate rows corresponding to the offsets in this split by calling the API next until all rows in this split are processed. | ||
|
||
## Future Work | ||
- Update data : The existing APIs will work for generating the update data using the Presto-Native TPC-DS connector, however it would require the addition of more Dsdgen codegen files (prefixed with s_*)). The generation of update data is still unsupported in Presto: https://github.com/prestodb/presto/issues/23135 | ||
## Test Plan | ||
|
||
Native end-to-end tests are added in https://github.com/prestodb/presto/pull/23067. | ||
Future enhancements will include adding SpeedTest and ConnectorTest to the Velox repository. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be great to give more information about TpcdsSplits and how data generation happens for a table from one split to the next.