Skip to content

Commit

Permalink
Add microbenchmarks
Browse files Browse the repository at this point in the history
  • Loading branch information
andygrove committed Jul 16, 2024
1 parent de8c55e commit 3b0d3f0
Show file tree
Hide file tree
Showing 18 changed files with 501 additions and 0 deletions.
69 changes: 69 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# DataFusion Comet Micro Benchmarks

The goal of these micro benchmarks is to enable benchmarking and performance profiling of simple queries
containing a small number of operators and expressions. These queries run against TPC-DS data and many of
the queries represent subsets of the original TPC-DS queries.

For full TPC-DS benchmarking, refer to the [DataFusion Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html).

Follow the [Comet Installation](https://datafusion.apache.org/comet/user-guide/installation.html) guide to download or
create a Comet JAR file and then set the `COMET_JAR` environment variable to point to that jar file.

```shell
export COMET_JAR=spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar
```

Set up `SPARK_HOME` to point to the relevant Spark version, and `SPARK_MASTER` with the master URL, then
use `spark-submit` to run the benchmark script.

```shell
export COMET_JAR=`pwd`/../spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar

$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.memory=32G \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.eventLog.enabled=true \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.comet.exec.all.enabled=true \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.explainFallback.enabled=true \
--conf spark.comet.shuffle.enforceMode.enabled=true \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
cometbench.py \
--data /mnt/bigdata/tpcds/sf100 \
--query join_exploding_output.sql \
--iterations 3
```

When benchmarking Comet, we are generally interested in comparing the performance of Spark with Comet disabled to
the performance of Spark with Comet enabled. Comet can be enabled or disabled by setting the `spark.comet.exec.enabled`
config appropriately.
34 changes: 34 additions & 0 deletions benchmarks/add_many_decimals.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- This is testing the cost of a complex expression that will create many intermediate arrays in Comet

select sum(
ss_wholesale_cost+
ss_list_price+
ss_sales_price+
ss_ext_discount_amt+
ss_ext_sales_price+
ss_ext_wholesale_cost+
ss_ext_list_price+
ss_ext_tax+
ss_coupon_amt+
ss_net_paid+
ss_net_paid_inc_tax+
ss_net_profit
)
from store_sales;
33 changes: 33 additions & 0 deletions benchmarks/add_many_integers.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- This is testing the cost of a complex expression that will create many intermediate arrays in Comet

select sum(
ss_sold_date_sk+
ss_sold_time_sk+
ss_item_sk+
ss_customer_sk+
ss_cdemo_sk+
ss_hdemo_sk+
ss_addr_sk+
ss_store_sk+
ss_promo_sk+
ss_ticket_number+
ss_quantity
)
from store_sales;
18 changes: 18 additions & 0 deletions benchmarks/agg_high_cardinality.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select ws_item_sk, sum(ws_wholesale_cost) from web_sales group by ws_item_sk;
18 changes: 18 additions & 0 deletions benchmarks/agg_low_cardinality.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select ws_warehouse_sk, sum(ws_wholesale_cost) from web_sales group by ws_warehouse_sk;
31 changes: 31 additions & 0 deletions benchmarks/agg_no_grouping.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select
sum(ss_wholesale_cost),
sum(ss_list_price),
sum(ss_sales_price),
sum(ss_ext_discount_amt),
sum(ss_ext_sales_price),
sum(ss_ext_wholesale_cost),
sum(ss_ext_list_price),
sum(ss_ext_tax),
sum(ss_coupon_amt),
sum(ss_net_paid),
sum(ss_net_paid_inc_tax),
sum(ss_net_profit)
from store_sales;
26 changes: 26 additions & 0 deletions benchmarks/case_when_column_or_null.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select
sum(case when (ss_quantity=1) then ss_sales_price else null end) sun_sales,
sum(case when (ss_quantity=2) then ss_sales_price else null end) mon_sales,
sum(case when (ss_quantity=3) then ss_sales_price else null end) tue_sales,
sum(case when (ss_quantity=4) then ss_sales_price else null end) wed_sales,
sum(case when (ss_quantity=5) then ss_sales_price else null end) thu_sales,
sum(case when (ss_quantity=6) then ss_sales_price else null end) fri_sales,
sum(case when (ss_quantity=7) then ss_sales_price else null end) sat_sales
from store_sales;
22 changes: 22 additions & 0 deletions benchmarks/case_when_scalar.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select
sum(case when ws_wholesale_cost > 10 then 1 else 0 end) as a,
sum(case when ws_wholesale_cost > 20 then 1 else 0 end) as b,
sum(case when ws_wholesale_cost > 30 then 1 else 0 end) as c
from web_sales;
75 changes: 75 additions & 0 deletions benchmarks/cometbench.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

import argparse
import os
from pyspark.sql import SparkSession
import time

def main(data_path: str, query_path: str, iterations: int):

# Initialize a SparkSession
spark = SparkSession.builder \
.appName("DataFusion Microbenchmarks: " + os.path.basename(query_path)) \
.getOrCreate()

# Register the tables
table_names = ["call_center", "catalog_page", "catalog_returns", "catalog_sales", "customer",
"customer_address", "customer_demographics", "date_dim", "time_dim", "household_demographics",
"income_band", "inventory", "item", "promotion", "reason", "ship_mode", "store", "store_returns",
"store_sales", "warehouse", "web_page", "web_returns", "web_sales", "web_site"]

for table in table_names:
path = f"{data_path}/{table}.parquet"
print(f"Registering table {table} using path {path}")
df = spark.read.parquet(path)
df.createOrReplaceTempView(table)

# read sql file
print(f"Reading query from path {query_path}")
with open(query_path, "r") as f:
sql = f.read().strip()


durations = []
for iteration in range(0, iterations):
print(f"Starting iteration {iteration} of {iterations}")

start_time = time.time()
df = spark.sql(sql)
rows = df.collect()

print(f"Query returned {len(rows)} rows")
end_time = time.time()
duration = end_time - start_time
print(f"Query took {duration} seconds")

durations.append(duration)

# Stop the SparkSession
spark.stop()

print(durations)

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="DataFusion benchmark derived from TPC-H / TPC-DS")
parser.add_argument("--data", required=True, help="Path to data files")
parser.add_argument("--query", required=True, help="Path to query file")
parser.add_argument("--iterations", required=False, default="1", help="How many iterations to run")
args = parser.parse_args()

main(args.data, args.query, int(args.iterations))
18 changes: 18 additions & 0 deletions benchmarks/filter_highly_selective.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select sum(ws_wholesale_cost) from web_sales where ws_wholesale_cost = 100;
18 changes: 18 additions & 0 deletions benchmarks/filter_less_selective.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

select sum(ws_wholesale_cost) from web_sales where ws_wholesale_cost > 10;
Loading

0 comments on commit 3b0d3f0

Please sign in to comment.