Skip to content

Conversation

zhangxffff
Copy link

  1. add --query argument suport for tpcbench.py to run custom query with tpch tables.
  2. fix scripts in docs/contributing.md

@zhangxffff
Copy link
Author

test in local environment

>> RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpcbench.py --data=tpch --concurrency=2 --batch-size=8182 --worker-pool-min=10 --validate --query 'select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name order by c_name limit 1' 
Executing custom query:  select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name order by c_name limit 1
2025-03-12 11:30:36,038 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
Registering table customer using path tpch/customer.parquet
Registering table lineitem using path tpch/lineitem.parquet
Registering table nation using path tpch/nation.parquet
Registering table orders using path tpch/orders.parquet
Registering table part using path tpch/part.parquet
Registering table partsupp using path tpch/partsupp.parquet
Registering table region using path tpch/region.parquet
Registering table supplier using path tpch/supplier.parquet
Writing results to datafusion-ray-tpch-1741750238210.json
statements = ['select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name order by c_name limit 1']
executing  select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name order by c_name limit 1
+--------------------+-----------+
| c_name             | total     |
+--------------------+-----------+
| Customer#000000001 | 587762.91 |
+--------------------+-----------+
done with query custom query
{
    "engine": "datafusion-ray",
    "benchmark": "tpch",
    "settings": {
        "concurrency": 2,
        "batch_size": 8182,
        "prefetch_buffer_size": 0,
        "partitions_per_worker": null
    },
    "data_path": "tpch",
    "queries": {
        "custom query": 0.19914793968200684
    },
    "validated": {
        "custom query": true
    }
}
benchmark complete. sleeping for 3 seconds for ray to clean up

@zhangxffff
Copy link
Author

Hi @robtandy , Would you mind reviewing this PR when you get a chance? This is the follow-up work we had discussed in #82 . Thanks for your time!

Copy link

@jazracherif jazracherif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Thank you for this PR!

I'm new to this project and ran into the exact issues handled by the PR, was going to submit one until I found there was already one, so just providing some of my thoughts on it!

Comment on lines +189 to +208
if (args.qnum != -1 and args.query is not None):
print("Please specify either --qnum or --query, but not both")
exit(1)

queries = []
if (args.qnum != -1):
if args.qnum < 1 or args.qnum > 22:
print("Invalid query number. Please specify a number between 1 and 22.")
exit(1)
else:
queries.append((str(args.qnum), tpch_query(args.qnum)))
print("Executing tpch query ", args.qnum)

elif (args.query is not None):
queries.append(("custom query", args.query))
print("Executing custom query: ", args.query)
else:
print("Executing all tpch queries")
queries = [(str(i), tpch_query(i)) for i in range(1, 23)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion, extract this into its own functions, for example

from typing import List
def get_sql_queries(tpch_qnum: str = None, sql_statement: str= None) -> List[(str, str)]:
    """
    Get the list of SQL statements from either the TPCH or user provided SQL statements.
    At most one of these parameters can be provided.

    :param tpch_qnum: the TPCH Query number. If none, return all TPCH queries supported
    :param sql_statement: SQL string statement on available data tables (e.g ingested through make_data.py)
    :return: a list of tuples with name of the Query and the string SQL statement
    """

parser.add_argument("--qnum", type=int, default=-1,
help="TPCH query number, 1-22")
parser.add_argument("--query", required=False, type=str,
help="Custom query to run with tpch tables")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Custom query to run with tpch tables")
help="Custom SQL query statement to run with tpch tables")

print("Invalid query number. Please specify a number between 1 and 22.")
exit(1)
else:
queries.append((str(args.qnum), tpch_query(args.qnum)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly mention TPCH in the id

Suggested change
queries.append((str(args.qnum), tpch_query(args.qnum)))
queries.append((f"TPCH-{args.qnum)}", tpch_query(args.qnum)))

print("Executing custom query: ", args.query)
else:
print("Executing all tpch queries")
queries = [(str(i), tpch_query(i)) for i in range(1, 23)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
queries = [(str(i), tpch_query(i)) for i in range(1, 23)]
queries = [(f"TPCH-{i}", tpch_query(i)) for i in range(1, 23)]


```bash
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpc.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 --qnum 2
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpcbench.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 --qnum 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend standardizing the data file directory to testdata/tpch and add the correct make_file.py command just above, for example

Suggested change
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpcbench.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 --qnum 2
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpcbench.py --data=../testdata/tpch --concurrency=2 --batch-size=8182 --worker-pool-min=10 --qnum 2

add before this more documentation one make_file

  • In the tpch directory, use make_data.py to create a TPCH dataset at a provided scale factor and an output director, such as the testdata directory
python make_data.py 1 "../testdata/tpch"

could also specify a env variable for this in the setup
TPCH_DATA=../testdata/tpch

and replace the examples with $TPCH_DATA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants