apache · andygrove · Oct 8, 2024 · Oct 8, 2024 · Oct 8, 2024
diff --git a/README.md b/README.md
@@ -30,10 +30,12 @@ under the License.
 <img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/>
 
 Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
-[Apache DataFusion](https://datafusion.apache.org) query engine. Comet is designed to significantly enhance the
+[Apache DataFusion] query engine. Comet is designed to significantly enhance the
 performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
 Spark ecosystem without requiring any code changes.
 
+[Apache DataFusion]: https://datafusion.apache.org
+
 # Benefits of Using Comet
 
 ## Run Spark Queries at DataFusion Speeds

diff --git a/docs/source/_static/images/CometNativeExecution.drawio.png b/docs/source/_static/images/CometNativeExecution.drawio.png
diff --git a/docs/source/_static/images/CometNativeParquetReader.drawio b/docs/source/_static/images/CometNativeParquetReader.drawio
diff --git a/docs/source/_static/images/CometNativeParquetReader.drawio.svg b/docs/source/_static/images/CometNativeParquetReader.drawio.svg
diff --git a/docs/source/_static/images/CometNativeParquetScan.drawio.png b/docs/source/_static/images/CometNativeParquetScan.drawio.png
diff --git a/docs/source/_static/images/CometOverviewDetailed.drawio b/docs/source/_static/images/CometOverviewDetailed.drawio
diff --git a/docs/source/_static/images/CometOverviewDetailed.drawio.svg b/docs/source/_static/images/CometOverviewDetailed.drawio.svg
diff --git a/docs/source/contributor-guide/plugin_overview.md b/docs/source/contributor-guide/plugin_overview.md
@@ -79,10 +79,10 @@ The leaf nodes in the physical plan are always `ScanExec` and these operators co
 prepared before the plan is executed. When `CometExecIterator` invokes `Native.executePlan` it passes the memory
 addresses of these Arrow arrays to the native code.
 
-![Diagram of Comet Native Execution](../../_static/images/CometNativeExecution.drawio.png)
+![Diagram of Comet Native Execution](../../_static/images/CometOverviewDetailed.drawio.svg)
 
 ## End to End Flow
 
 The following diagram shows the end-to-end flow.
 
-![Diagram of Comet Native Parquet Scan](../../_static/images/CometNativeParquetScan.drawio.png)
+![Diagram of Comet Native Parquet Scan](../../_static/images/CometNativeParquetReader.drawio.svg)
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -42,6 +42,8 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
 
    Comet Overview <user-guide/overview>
    Installing Comet <user-guide/installation>
+   Building From Source <user-guide/source>
+   Kubernetes Guide <user-guide/kubernetes>
    Supported Data Sources <user-guide/datasources>
    Supported Data Types <user-guide/datatypes>
    Supported Operators <user-guide/operators>

diff --git a/docs/source/user-guide/installation.md b/docs/source/user-guide/installation.md
@@ -19,73 +19,54 @@
 
 # Installing DataFusion Comet
 
+## Prerequisites
+
 Make sure the following requirements are met and software installed on your machine.
 
-## Supported Platforms
+### Supported Operating Systems
 
 - Linux
 - Apple OSX (Intel and Apple Silicon)
 
-## Requirements
+### Supported Spark Versions
 
-- [Apache Spark supported by Comet](overview.md#supported-apache-spark-versions)
-- JDK 8 and up
-- GLIBC 2.17 (Centos 7) and up
+Comet currently supports the following versions of Apache Spark:
 
-## Deploying to Kubernetes
+- 3.3.x (Java 8/11/17, Scala 2.12/2.13)
+- 3.4.x (Java 8/11/17, Scala 2.12/2.13)
+- 3.5.x (Java 8/11/17, Scala 2.12/2.13)
 
-See the [Comet Kubernetes Guide](kubernetes.md) guide.
-
-## Using a Published JAR File
+Experimental support is provided for the following versions of Apache Spark and is intended for development/testing
+use only and should not be used in production yet.
 
-Pre-built jar files are available in Maven central at https://central.sonatype.com/namespace/org.apache.datafusion
+- 4.0.0-preview1 (Java 17/21, Scala 2.13)
 
-## Using a Published Source Release
-
-Official source releases can be downloaded from https://dist.apache.org/repos/dist/release/datafusion/
-
-```console
-# Pick the latest version
-export COMET_VERSION=0.3.0
-# Download the tarball
-curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
-# Unpack
-tar -xzf apache-datafusion-comet-$COMET_VERSION.tar.gz
-cd apache-datafusion-comet-$COMET_VERSION
-```
+Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by
+Cloud Service Providers.
 
-Build
-
-```console
-make release-nogit PROFILES="-Pspark-3.4"
-```
-
-## Building from the GitHub repository
+## Using a Published JAR File
 
-Clone the repository:
+Comet jar files are available in [Maven Central](https://central.sonatype.com/namespace/org.apache.datafusion).
 
-```console
-git clone https://github.com/apache/datafusion-comet.git
-```
+Here are the direct links for downloading the Comet jar file.
 
-Build Comet for a specific Spark version:
+- [Comet plugin for Spark 3.3 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.3_2.12/0.3.0/comet-spark-spark3.3_2.12-0.3.0.jar)
+- [Comet plugin for Spark 3.3 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.3_2.13/0.3.0/comet-spark-spark3.3_2.13-0.3.0.jar)
+- [Comet plugin for Spark 3.4 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.4_2.12/0.3.0/comet-spark-spark3.4_2.12-0.3.0.jar)
+- [Comet plugin for Spark 3.4 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.4_2.13/0.3.0/comet-spark-spark3.4_2.13-0.3.0.jar)
+- [Comet plugin for Spark 3.5 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.3.0/comet-spark-spark3.5_2.12-0.3.0.jar)
+- [Comet plugin for Spark 3.5 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.13/0.3.0/comet-spark-spark3.5_2.13-0.3.0.jar)
 
-```console
-cd datafusion-comet
-make release PROFILES="-Pspark-3.4"
-```
+## Building from source
 
-Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile:
+Refer to the [Building from Source] guide for instructions from building Comet from source, either from official
+source releases, or from the latest code in the GitHub repository.
 
-```console
-make release PROFILES="-Pspark-3.4 -Pscala-2.13"
-```
+[Building from Source]: source.md
 
-To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use:
+## Deploying to Kubernetes
 
-```console
-make release-nogit PROFILES="-Pspark-3.4"
-```
+See the [Comet Kubernetes Guide](kubernetes.md) guide.
 
 ## Run Spark Shell with Comet enabled
 
@@ -99,11 +80,10 @@ $SPARK_HOME/bin/spark-shell \
     --conf spark.driver.extraClassPath=$COMET_JAR \
     --conf spark.executor.extraClassPath=$COMET_JAR \
     --conf spark.plugins=org.apache.spark.CometPlugin \
-    --conf spark.comet.enabled=true \
-    --conf spark.comet.exec.enabled=true \
+    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
     --conf spark.comet.explainFallback.enabled=true \
-    --conf spark.driver.memory=1g \
-    --conf spark.executor.memory=1g
+    --conf spark.memory.offHeap.enabled=true \
+    --conf spark.memory.offHeap.size=16g \
 ```
 
 ### Verify Comet enabled for Spark SQL query
@@ -142,20 +122,9 @@ WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts
   - Execute InsertIntoHadoopFsRelationCommand is not supported
 ```
 
-### Enable Comet shuffle
+## Additional Configuration
 
-Comet shuffle feature is disabled by default. To enable it, please add related configs:
-
-```
---conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
---conf spark.comet.exec.shuffle.enabled=true
-```
-
-Above configs enable Comet native shuffle which only supports hash partition and single partition.
-Comet native shuffle doesn't support complex types yet.
-
-Comet doesn't have official release yet so currently the only way to test it is to build jar and include it in your
-Spark application. Depending on your deployment mode you may also need to set the driver & executor class path(s) to
+Depending on your deployment mode you may also need to set the driver & executor class path(s) to
 explicitly contain Comet otherwise Spark may use a different class-loader for the Comet components than its internal
 components which will then fail at runtime. For example:
 
@@ -165,11 +134,7 @@ components which will then fail at runtime. For example:
 
 Some cluster managers may require additional configuration, see <https://spark.apache.org/docs/latest/cluster-overview.html>
 
-To enable columnar shuffle which supports all partitioning and basic complex types, one more config is required:
-
-```
---conf spark.comet.exec.shuffle.mode=jvm
-```
-
 ### Memory tuning
-In addition to Apache Spark memory configuration parameters the Comet introduces own parameters to configure memory allocation for native execution. More [Comet Memory Tuning](./tuning.md)
+
+In addition to Apache Spark memory configuration parameters, Comet introduces additional parameters to configure memory
+allocation for native execution. See [Comet Memory Tuning](./tuning.md) for details.
diff --git a/docs/source/user-guide/overview.md b/docs/source/user-guide/overview.md
@@ -19,8 +19,14 @@
 
 # Comet Overview
 
-Comet runs Spark SQL queries using the native Apache DataFusion runtime, which is
-typically faster and more resource efficient than JVM based runtimes.
+Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
+[Apache DataFusion] query engine. Comet is designed to significantly enhance the
+performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
+Spark ecosystem without requiring any code changes.
+
+[Apache DataFusion]: https://datafusion.apache.org
+
+The following diagram provides an overview of Comet's architecture.
 
 ![Comet Overview](../_static/images/comet-overview.png)
 
@@ -34,26 +40,10 @@ Comet aims to support:
 
 ## Architecture
 
-The following diagram illustrates the architecture of Comet:
+The following diagram shows how Comet integrates with Apache Spark.
 
 ![Comet System Diagram](../_static/images/comet-system-diagram.png)
 
-## Supported Apache Spark versions
-
-Comet currently supports the following versions of Apache Spark:
-
-- 3.3.x
-- 3.4.x
-- 3.5.x
-
-Experimental support is provided for the following versions of Apache Spark and is intended for development/testing 
-use only and should not be used in production yet.
-
-- 4.0.0-preview1
-
-Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by 
-Cloud Service Providers. 
-
 ## Feature Parity with Apache Spark
 
 The project strives to keep feature parity with Apache Spark, that is,
@@ -65,3 +55,9 @@ features and fallback to Spark engine.
 To achieve this, besides unit tests within Comet itself, we also re-use
 Spark SQL tests and make sure they all pass with Comet extension
 enabled.
+
+## Getting Started
+
+Refer to the [Comet Installation Guide] to get started.
+
+[Comet Installation Guide]: installation.md
diff --git a/docs/source/user-guide/source.md b/docs/source/user-guide/source.md
@@ -0,0 +1,69 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Building Comet From Source
+
+It is sometimes preferable to build from source for a specific platform.
+
+## Using a Published Source Release
+
+Official source releases can be downloaded from https://dist.apache.org/repos/dist/release/datafusion/
+
+```console
+# Pick the latest version
+export COMET_VERSION=0.3.0
+# Download the tarball
+curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
+# Unpack
+tar -xzf apache-datafusion-comet-$COMET_VERSION.tar.gz
+cd apache-datafusion-comet-$COMET_VERSION
+```
+
+Build
+
+```console
+make release-nogit PROFILES="-Pspark-3.4"
+```
+
+## Building from the GitHub repository
+
+Clone the repository:
+
+```console
+git clone https://github.com/apache/datafusion-comet.git
+```
+
+Build Comet for a specific Spark version:
+
+```console
+cd datafusion-comet
+make release PROFILES="-Pspark-3.4"
+```
+
+Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile:
+
+```console
+make release PROFILES="-Pspark-3.4 -Pscala-2.13"
+```
+
+To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use:
+
+```console
+make release-nogit PROFILES="-Pspark-3.4"
+```