Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Various documentation improvements #1005

Merged
merged 2 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,12 @@ under the License.
<img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/>

Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
[Apache DataFusion](https://datafusion.apache.org) query engine. Comet is designed to significantly enhance the
[Apache DataFusion] query engine. Comet is designed to significantly enhance the
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
Spark ecosystem without requiring any code changes.

[Apache DataFusion]: https://datafusion.apache.org

# Benefits of Using Comet

## Run Spark Queries at DataFusion Speeds
Expand Down
Binary file not shown.
100 changes: 100 additions & 0 deletions docs/source/_static/images/CometNativeParquetReader.drawio

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
94 changes: 94 additions & 0 deletions docs/source/_static/images/CometOverviewDetailed.drawio

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/source/_static/images/CometOverviewDetailed.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/source/contributor-guide/plugin_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,10 @@ The leaf nodes in the physical plan are always `ScanExec` and these operators co
prepared before the plan is executed. When `CometExecIterator` invokes `Native.executePlan` it passes the memory
addresses of these Arrow arrays to the native code.

![Diagram of Comet Native Execution](../../_static/images/CometNativeExecution.drawio.png)
![Diagram of Comet Native Execution](../../_static/images/CometOverviewDetailed.drawio.svg)

## End to End Flow

The following diagram shows the end-to-end flow.

![Diagram of Comet Native Parquet Scan](../../_static/images/CometNativeParquetScan.drawio.png)
![Diagram of Comet Native Parquet Scan](../../_static/images/CometNativeParquetReader.drawio.svg)
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ as a native runtime to achieve improvement in terms of query efficiency and quer

Comet Overview <user-guide/overview>
Installing Comet <user-guide/installation>
Building From Source <user-guide/source>
Kubernetes Guide <user-guide/kubernetes>
Supported Data Sources <user-guide/datasources>
Supported Data Types <user-guide/datatypes>
Supported Operators <user-guide/operators>
Expand Down
107 changes: 36 additions & 71 deletions docs/source/user-guide/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,73 +19,54 @@

# Installing DataFusion Comet

## Prerequisites

Make sure the following requirements are met and software installed on your machine.

## Supported Platforms
### Supported Operating Systems

- Linux
- Apple OSX (Intel and Apple Silicon)

## Requirements
### Supported Spark Versions

- [Apache Spark supported by Comet](overview.md#supported-apache-spark-versions)
- JDK 8 and up
- GLIBC 2.17 (Centos 7) and up
Comet currently supports the following versions of Apache Spark:

## Deploying to Kubernetes
- 3.3.x (Java 8/11/17, Scala 2.12/2.13)
- 3.4.x (Java 8/11/17, Scala 2.12/2.13)
- 3.5.x (Java 8/11/17, Scala 2.12/2.13)

See the [Comet Kubernetes Guide](kubernetes.md) guide.

## Using a Published JAR File
Experimental support is provided for the following versions of Apache Spark and is intended for development/testing
use only and should not be used in production yet.

Pre-built jar files are available in Maven central at https://central.sonatype.com/namespace/org.apache.datafusion
- 4.0.0-preview1 (Java 17/21, Scala 2.13)

## Using a Published Source Release

Official source releases can be downloaded from https://dist.apache.org/repos/dist/release/datafusion/

```console
# Pick the latest version
export COMET_VERSION=0.3.0
# Download the tarball
curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
# Unpack
tar -xzf apache-datafusion-comet-$COMET_VERSION.tar.gz
cd apache-datafusion-comet-$COMET_VERSION
```
Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by
Cloud Service Providers.

Build

```console
make release-nogit PROFILES="-Pspark-3.4"
```

## Building from the GitHub repository
## Using a Published JAR File

Clone the repository:
Comet jar files are available in [Maven Central](https://central.sonatype.com/namespace/org.apache.datafusion).

```console
git clone https://github.com/apache/datafusion-comet.git
```
Here are the direct links for downloading the Comet jar file.

Build Comet for a specific Spark version:
- [Comet plugin for Spark 3.3 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.3_2.12/0.3.0/comet-spark-spark3.3_2.12-0.3.0.jar)
- [Comet plugin for Spark 3.3 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.3_2.13/0.3.0/comet-spark-spark3.3_2.13-0.3.0.jar)
- [Comet plugin for Spark 3.4 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.4_2.12/0.3.0/comet-spark-spark3.4_2.12-0.3.0.jar)
- [Comet plugin for Spark 3.4 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.4_2.13/0.3.0/comet-spark-spark3.4_2.13-0.3.0.jar)
- [Comet plugin for Spark 3.5 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.3.0/comet-spark-spark3.5_2.12-0.3.0.jar)
- [Comet plugin for Spark 3.5 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.13/0.3.0/comet-spark-spark3.5_2.13-0.3.0.jar)

```console
cd datafusion-comet
make release PROFILES="-Pspark-3.4"
```
## Building from source

Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile:
Refer to the [Building from Source] guide for instructions from building Comet from source, either from official
source releases, or from the latest code in the GitHub repository.

```console
make release PROFILES="-Pspark-3.4 -Pscala-2.13"
```
[Building from Source]: source.md

To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use:
## Deploying to Kubernetes

```console
make release-nogit PROFILES="-Pspark-3.4"
```
See the [Comet Kubernetes Guide](kubernetes.md) guide.

## Run Spark Shell with Comet enabled

Expand All @@ -99,11 +80,10 @@ $SPARK_HOME/bin/spark-shell \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
--conf spark.comet.explainFallback.enabled=true \
--conf spark.driver.memory=1g \
--conf spark.executor.memory=1g
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
```

### Verify Comet enabled for Spark SQL query
Expand Down Expand Up @@ -142,20 +122,9 @@ WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts
- Execute InsertIntoHadoopFsRelationCommand is not supported
```

### Enable Comet shuffle
## Additional Configuration

Comet shuffle feature is disabled by default. To enable it, please add related configs:

```
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
--conf spark.comet.exec.shuffle.enabled=true
```

Above configs enable Comet native shuffle which only supports hash partition and single partition.
Comet native shuffle doesn't support complex types yet.

Comet doesn't have official release yet so currently the only way to test it is to build jar and include it in your
Spark application. Depending on your deployment mode you may also need to set the driver & executor class path(s) to
Depending on your deployment mode you may also need to set the driver & executor class path(s) to
explicitly contain Comet otherwise Spark may use a different class-loader for the Comet components than its internal
components which will then fail at runtime. For example:

Expand All @@ -165,11 +134,7 @@ components which will then fail at runtime. For example:

Some cluster managers may require additional configuration, see <https://spark.apache.org/docs/latest/cluster-overview.html>

To enable columnar shuffle which supports all partitioning and basic complex types, one more config is required:

```
--conf spark.comet.exec.shuffle.mode=jvm
```

### Memory tuning
In addition to Apache Spark memory configuration parameters the Comet introduces own parameters to configure memory allocation for native execution. More [Comet Memory Tuning](./tuning.md)

In addition to Apache Spark memory configuration parameters, Comet introduces additional parameters to configure memory
allocation for native execution. See [Comet Memory Tuning](./tuning.md) for details.
34 changes: 15 additions & 19 deletions docs/source/user-guide/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,14 @@

# Comet Overview

Comet runs Spark SQL queries using the native Apache DataFusion runtime, which is
typically faster and more resource efficient than JVM based runtimes.
Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
[Apache DataFusion] query engine. Comet is designed to significantly enhance the
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
Spark ecosystem without requiring any code changes.

[Apache DataFusion]: https://datafusion.apache.org

The following diagram provides an overview of Comet's architecture.

![Comet Overview](../_static/images/comet-overview.png)

Expand All @@ -34,26 +40,10 @@ Comet aims to support:

## Architecture

The following diagram illustrates the architecture of Comet:
The following diagram shows how Comet integrates with Apache Spark.

![Comet System Diagram](../_static/images/comet-system-diagram.png)

## Supported Apache Spark versions

Comet currently supports the following versions of Apache Spark:

- 3.3.x
- 3.4.x
- 3.5.x

Experimental support is provided for the following versions of Apache Spark and is intended for development/testing
use only and should not be used in production yet.

- 4.0.0-preview1

Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by
Cloud Service Providers.

## Feature Parity with Apache Spark

The project strives to keep feature parity with Apache Spark, that is,
Expand All @@ -65,3 +55,9 @@ features and fallback to Spark engine.
To achieve this, besides unit tests within Comet itself, we also re-use
Spark SQL tests and make sure they all pass with Comet extension
enabled.

## Getting Started

Refer to the [Comet Installation Guide] to get started.

[Comet Installation Guide]: installation.md
69 changes: 69 additions & 0 deletions docs/source/user-guide/source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Building Comet From Source

It is sometimes preferable to build from source for a specific platform.

## Using a Published Source Release

Official source releases can be downloaded from https://dist.apache.org/repos/dist/release/datafusion/

```console
# Pick the latest version
export COMET_VERSION=0.3.0
# Download the tarball
curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
# Unpack
tar -xzf apache-datafusion-comet-$COMET_VERSION.tar.gz
cd apache-datafusion-comet-$COMET_VERSION
```

Build

```console
make release-nogit PROFILES="-Pspark-3.4"
```

## Building from the GitHub repository

Clone the repository:

```console
git clone https://github.com/apache/datafusion-comet.git
```

Build Comet for a specific Spark version:

```console
cd datafusion-comet
make release PROFILES="-Pspark-3.4"
```

Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile:

```console
make release PROFILES="-Pspark-3.4 -Pscala-2.13"
```

To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use:

```console
make release-nogit PROFILES="-Pspark-3.4"
```