Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 28 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ LakeBench exists to bring clarity, trust, accessibility, and relevance to engine


## ✅ Why LakeBench?
- **Multi-Engine**: Benchmark Spark, DuckDB, Polars, and many more planned, side-by-side
- **Multi-Engine**: Benchmark Spark, DuckDB, Polars, Daft, Sail and others, side-by-side
- **Lifecycle Coverage**: Ingest, transform, maintain, and query—just like real workloads
- **Diverse Workloads**: Test performance across varied data shapes and operations
- **Consistent Execution**: One framework, many engines
Expand All @@ -46,7 +46,7 @@ LakeBench empowers data teams to make informed engine decisions based on real wo

LakeBench currently supports four benchmarks with more to come:

- **ELTBench**: An benchmark with various modes (`light`, `full`) that simulates typicaly ELT workloads:
- **ELTBench**: An benchmark that simulates typicaly ELT workloads:
- Raw data load (Parquet → Delta)
- Fact table generation
- Incremental merge processing
Expand All @@ -65,7 +65,10 @@ LakeBench supports multiple lakehouse compute engines. Each benchmark scenario d

| Engine | ELTBench | TPC-DS | TPC-H | ClickBench |
|-----------------|:--------:|:------:|:-------:|:----------:|
| Spark (Fabric) | ✅ | ✅ | ✅ | ✅ |
| Spark (Generic) | ✅ | ✅ | ✅ | ✅ |
| Fabric Spark | ✅ | ✅ | ✅ | ✅ |
| Synapse Spark | ✅ | ✅ | ✅ | ✅ |
| HDInsight Spark | ✅ | ✅ | ✅ | ✅ |
| DuckDB | ✅ | ✅ | ✅ | ✅ |
| Polars | ✅ | ⚠️ | ⚠️ | 🔜 |
| Daft | ✅ | ⚠️ | ⚠️ | 🔜 |
Expand All @@ -77,6 +80,28 @@ LakeBench supports multiple lakehouse compute engines. Each benchmark scenario d
> 🔜 = Coming Soon
> (Blank) = Not currently supported

## Where Can I Run LakeBench?
Multiple modalities doesn't end at just benchmarks and engines, LakeBench also supports different runtimes and storage backends:

**Runtimes**:
- Local (Windows)
- Fabric
- Synapse
- HDInsight
- Google Colab ⚠️

**Storage Systems**:
- Local filesystem (Windows)
- OneLake
- ADLS gen2 (temporarily only in Fabric, Synapse, and HDInsight)
- S3 ⚠️
- GS ⚠️

_* ⚠️ denotes experimental storage backends_

## What Table Formats Are Supported?
LakeBench currently only supports Delta Lake.

## 🔌 Extensibility by Design

LakeBench is designed to be _extensible_, both for additional engines and benchmarks.
Expand Down Expand Up @@ -123,8 +148,6 @@ Install from PyPi:
pip install lakebench[duckdb,polars,daft,tpcds_datagen,tpch_datagen,sparkmeasure]
```

_Note: in this initial beta version, all engines have only been tested inside Microsoft Fabric Python and Spark Notebooks._

## Example Usage
To run any LakeBench benchmark, first do a one time generation of the data required for the benchmark and scale of interest. LakeBench provides datagen classes to quickly generate parquet datasets required by the benchmarks.

Expand Down
182 changes: 182 additions & 0 deletions examples/benchmarks/hdi_spark.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "04aa8c89",
"metadata": {},
"outputs": [],
"source": [
"%%configure -f\n",
"{\n",
" \"conf\": {\n",
" \"spark.jars\": \"abfss://<container>@<storage_account_name>.dfs.core.windows.net/jars/delta-core_2.12-2.1.1.jar\"\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c398c05",
"metadata": {},
"outputs": [],
"source": [
"# build lakebench zip and upload to ADLS Gen2\n",
"sc.addPyFile(\"abfss://<container>@<storage_account_name>.dfs.core.windows.net/libs/lakebench.zip\")\n",
"import lakebench"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ab46f85",
"metadata": {},
"outputs": [],
"source": [
"# Enable arbitrary Delta table properties to prevent failure if LakeBench attempts to set newer properties that are not the HDI compatible version of Delta Lake\n",
"spark.conf.set('spark.databricks.delta.allowArbitraryProperties.enabled', True)"
]
},
{
"cell_type": "markdown",
"id": "24c7f205",
"metadata": {},
"source": [
"## Run ELTBench in `light` mode"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "feb7d1b3",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import HDISpark\n",
"from lakebench.benchmarks import ELTBench\n",
"\n",
"engine = HDISpark(\n",
" schema_name ='spark_eltbench_test',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = ELTBench(\n",
" engine=engine,\n",
" scenario_name=\"SF1\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds_sf1',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/lakebench/results'\n",
" )\n",
"benchmark.run(mode=\"light\")"
]
},
{
"cell_type": "markdown",
"id": "6d1ab723",
"metadata": {},
"source": [
"## Run TPCDS `power_test` (Load tables and run all queries)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "feaf7122",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import HDISpark\n",
"from lakebench.benchmarks import TPCDS\n",
"\n",
"engine = HDISpark(\n",
" schema_name = 'spark_tpcds_sf1',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = TPCDS(\n",
" engine=engine,\n",
" scenario_name=\"SF1 - Power Test\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds_sf1',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/dbo/results'\n",
" )\n",
"benchmark.run(mode=\"power_test\")"
]
},
{
"cell_type": "markdown",
"id": "88ac860b",
"metadata": {},
"source": [
"## Run TPCDS `query` test: q1 run 4 times"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cae6db9b",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import HDISpark\n",
"from lakebench.benchmarks import TPCDS\n",
"\n",
"engine = HDISpark(\n",
" schema_name = 'spark_tpcds_sf1',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = TPCDS(\n",
" engine=engine,\n",
" scenario_name=\"SF1 - Q4*4\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds/sf1_parquet',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/dbo/results',\n",
" query_list=['q1'] * 4\n",
" )\n",
"benchmark.run(mode=\"query\")"
]
},
{
"cell_type": "markdown",
"id": "52a01f5b",
"metadata": {},
"source": [
"## Run TPCH Query Test (Run all queries)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0768e9b8",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import HDISpark\n",
"from lakebench.benchmarks import TPCH\n",
"\n",
"engine = HDISpark(\n",
" schema_name = 'spark_tpch_sf10',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = TPCH(\n",
" engine=engine,\n",
" scenario_name=\"SF10 - All Queries\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds/sf10_parquet',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/dbo/results'\n",
" )\n",
"benchmark.run(mode=\"query\")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
144 changes: 144 additions & 0 deletions examples/benchmarks/synapse_spark.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "24c7f205",
"metadata": {},
"source": [
"## Run ELTBench in `light` mode"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "feb7d1b3",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import SynapseSpark\n",
"from lakebench.benchmarks import ELTBench\n",
"\n",
"engine = SynapseSpark(\n",
" schema_name ='spark_eltbench_test',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = ELTBench(\n",
" engine=engine,\n",
" scenario_name=\"SF1\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds_sf1',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/lakebench/results'\n",
" )\n",
"benchmark.run(mode=\"light\")"
]
},
{
"cell_type": "markdown",
"id": "6d1ab723",
"metadata": {},
"source": [
"## Run TPCDS `power_test` (Load tables and run all queries)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "feaf7122",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import SynapseSpark\n",
"from lakebench.benchmarks import TPCDS\n",
"\n",
"engine = SynapseSpark(\n",
" schema_name = 'spark_tpcds_sf1',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = TPCDS(\n",
" engine=engine,\n",
" scenario_name=\"SF1 - Power Test\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds_sf1',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/dbo/results'\n",
" )\n",
"benchmark.run(mode=\"power_test\")"
]
},
{
"cell_type": "markdown",
"id": "88ac860b",
"metadata": {},
"source": [
"## Run TPCDS `query` test: q1 run 4 times"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cae6db9b",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import SynapseSpark\n",
"from lakebench.benchmarks import TPCDS\n",
"\n",
"engine = SynapseSpark(\n",
" schema_name = 'spark_tpcds_sf1',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = TPCDS(\n",
" engine=engine,\n",
" scenario_name=\"SF1 - Q4*4\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds/sf1_parquet',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/dbo/results',\n",
" query_list=['q1'] * 4\n",
" )\n",
"benchmark.run(mode=\"query\")"
]
},
{
"cell_type": "markdown",
"id": "52a01f5b",
"metadata": {},
"source": [
"## Run TPCH Query Test (Run all queries)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0768e9b8",
"metadata": {},
"outputs": [],
"source": [
"from lakebench.engines import SynapseSpark\n",
"from lakebench.benchmarks import TPCH\n",
"\n",
"engine = SynapseSpark(\n",
" schema_name = 'spark_tpch_sf10',\n",
" spark_measure_telemetry = False\n",
")\n",
"\n",
"benchmark = TPCH(\n",
" engine=engine,\n",
" scenario_name=\"SF10 - All Queries\",\n",
" input_parquet_folder_uri='abfss://........./Files/tpcds/sf10_parquet',\n",
" save_results=True,\n",
" result_table_uri='abfss://......../Tables/dbo/results'\n",
" )\n",
"benchmark.run(mode=\"query\")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading