Merge branch 'apache:master' into master

wherobots · Nov 5, 2024 · 649cafc · 649cafc
2 parents 3e4d8e2 + fcea411
commit 649cafc
Show file tree

Hide file tree

Showing 37 changed files with 1,611 additions and 41 deletions.
diff --git a/.github/workflows/license-templates/LICENSE.txt b/.github/workflows/license-templates/LICENSE.txt
@@ -0,0 +1,16 @@
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,15 +10,22 @@ repos:
     hooks:
       - id: identity
       - id: check-hooks-apply
+  - repo: https://github.com/Lucas-C/pre-commit-hooks
+    rev: v1.5.5
+    hooks:
+      - id: insert-license
+        name: Add license for all TOML files
+        files: \.toml$
+        args:
+          - --comment-style
+          - "|#|"
+          - --license-filepath
+          - .github/workflows/license-templates/LICENSE.txt
+          - --fuzzy-match-generates-todo
   - repo: https://github.com/psf/black-pre-commit-mirror
     rev: 24.10.0
     hooks:
       - id: black-jupyter
-  # - repo: https://github.com/pycqa/isort
-  #   rev: 5.13.2
-  #   hooks:
-  #     - id: isort
-  #       name: isort (python)
   - repo: https://github.com/pre-commit/mirrors-clang-format
     rev: v19.1.1
     hooks:

diff --git a/R/vignettes/articles/apache-sedona.Rmd b/R/vignettes/articles/apache-sedona.Rmd
@@ -296,7 +296,7 @@ file in a supported geospatial format (`sedona_read_*` functions), or by extract
 Spark SQL query.
 
 For example, the following code will import data from
-[arealm-small.csv](https://github.com/apache/sedona/blob/master/binder/data/arealm-small.csv)
+[arealm-small.csv](https://github.com/apache/sedona/blob/master/docs/usecases/data/arealm-small.csv)
 into a `SpatialRDD`:
 
 ```{r}
@@ -311,7 +311,7 @@ pt_rdd <- sedona_read_dsv_to_typed_rdd(
 ```
 
 Records from the example
-[arealm-small.csv](https://github.com/apache/sedona/blob/master/binder/data/arealm-small.csv)
+[arealm-small.csv](https://github.com/apache/sedona/blob/master/docs/usecases/data/arealm-small.csv)
 file look like the following:
 
     testattribute0,-88.331492,32.324142,testattribute1,testattribute2

diff --git a/README.md b/README.md
@@ -34,25 +34,26 @@ Join the Sedona monthly community office hour: [Google Calendar](https://calenda
 
 ## What is Apache Sedona?
 
-Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as Apache Spark and Apache Flink. Sedona developers can express their spatial data processing tasks in Spatial SQL, Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.
+Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as [Apache Spark](https://spark.apache.org/) and [Apache Flink](https://flink.apache.org/).
+Sedona developers can express their spatial data processing tasks in [Spatial SQL](https://carto.com/spatial-sql), Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.
 
 ![Sedona Ecosystem](docs/image/sedona-ecosystem.png "Sedona Ecosystem")
 
 ### Features
 
 Some of the key features of Apache Sedona include:
 
-* Support for a wide range of geospatial data formats, including GeoJSON, WKT, and ESRI Shapefile.
+* Support for a wide range of geospatial data formats, including [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON), [WKT](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry), and [ESRI](https://www.esri.com) [Shapefile](https://en.wikipedia.org/wiki/Shapefile).
 * Scalable distributed processing of large vector and raster datasets.
 * Tools for spatial indexing, spatial querying, and spatial join operations.
-* Integration with popular geospatial python tools such as GeoPandas.
-* Integration with popular big data tools, such as Spark, Hadoop, Hive, and Flink for data storage and querying.
-* A user-friendly API for working with geospatial data in the SQL, Python, Scala and Java languages.
+* Integration with popular geospatial Python tools such as [GeoPandas](https://geopandas.org).
+* Integration with popular big data tools, such as Spark, [Hadoop](https://hadoop.apache.org/), [Hive](https://hive.apache.org/), and Flink for data storage and querying.
+* A user-friendly API for working with geospatial data in the [SQL](https://en.wikipedia.org/wiki/SQL), [Python](https://www.python.org/), [Scala](https://www.scala-lang.org/) and [Java](https://www.java.com) languages.
 * Flexible deployment options, including standalone, local, and cluster modes.
 
 These are some of the key features of Apache Sedona, but it may offer additional capabilities depending on the specific version and configuration.
 
-Click [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/apache/sedona/HEAD?filepath=docs/usecases) and play the interactive Sedona Python Jupyter Notebook immediately!
+Click [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/apache/sedona/HEAD?filepath=docs/usecases) and play the interactive Sedona Python [Jupyter](https://jupyter.org/) Notebook immediately!
 
 ## When to use Sedona?
 
@@ -150,5 +151,6 @@ Please visit [Apache Sedona website](http://sedona.apache.org/) for detailed inf
 ## Powered by
 
 <a href="https://www.apache.org/">
-  <img alt="The Apache Software Foundation" src="https://www.apache.org/foundation/press/kit/asf_logo_wide.png" width="500" class="center">
+  <img alt="The Apache Software Foundation" class="center" src="https://www.apache.org/foundation/press/kit/asf_logo_wide.png"
+    title="The Apache Software Foundation" width="500">
 </a>
diff --git a/docker/sedona-spark-jupyterlab/requirements.txt b/docker/sedona-spark-jupyterlab/requirements.txt
@@ -1,6 +1,6 @@
 attrs
 descartes
-fiona==1.8.22
+fiona==1.10.1
 geopandas==0.14.4
 ipykernel
 ipywidgets

diff --git a/docker/sedona-spark-jupyterlab/sedona-jupyterlab.dockerfile b/docker/sedona-spark-jupyterlab/sedona-jupyterlab.dockerfile
@@ -19,7 +19,6 @@ FROM ubuntu:22.04
 
 ARG shared_workspace=/opt/workspace
 ARG spark_version=3.4.1
-ARG hadoop_version=3
 ARG hadoop_s3_version=3.3.4
 ARG aws_sdk_version=1.12.402
 ARG spark_xml_version=0.16.0
@@ -29,8 +28,7 @@ ARG spark_extension_version=2.11.0
 
 # Set up envs
 ENV SHARED_WORKSPACE=${shared_workspace}
-ENV SPARK_HOME /opt/spark
-RUN mkdir ${SPARK_HOME}
+ENV SPARK_HOME /usr/local/lib/python3.10/dist-packages/pyspark
 ENV SEDONA_HOME /opt/sedona
 RUN mkdir ${SEDONA_HOME}
 
@@ -44,7 +42,7 @@ COPY ./ ${SEDONA_HOME}/
 
 RUN chmod +x ${SEDONA_HOME}/docker/spark.sh
 RUN chmod +x ${SEDONA_HOME}/docker/sedona.sh
-RUN ${SEDONA_HOME}/docker/spark.sh ${spark_version} ${hadoop_version} ${hadoop_s3_version} ${aws_sdk_version} ${spark_xml_version}
+RUN ${SEDONA_HOME}/docker/spark.sh ${spark_version} ${hadoop_s3_version} ${aws_sdk_version} ${spark_xml_version}
 
 # Install Python dependencies
 COPY docker/sedona-spark-jupyterlab/requirements.txt /opt/requirements.txt

diff --git a/docker/spark.sh b/docker/spark.sh
@@ -19,20 +19,16 @@ set -e
 
 # Define variables
 spark_version=$1
-hadoop_version=$2
-hadoop_s3_version=$3
-aws_sdk_version=$4
-spark_xml_version=$5
+hadoop_s3_version=$2
+aws_sdk_version=$3
+spark_xml_version=$4
 
 # Set up OS libraries
 apt-get update
 apt-get install -y openjdk-19-jdk-headless curl python3-pip maven
 pip3 install --upgrade pip && pip3 install pipenv
 
 # Download Spark jar and set up PySpark
-curl https://archive.apache.org/dist/spark/spark-"${spark_version}"/spark-"${spark_version}"-bin-hadoop"${hadoop_version}".tgz -o spark.tgz
-tar -xf spark.tgz && mv spark-"${spark_version}"-bin-hadoop"${hadoop_version}"/* "${SPARK_HOME}"/
-rm spark.tgz && rm -rf spark-"${spark_version}"-bin-hadoop"${hadoop_version}"
 pip3 install pyspark=="${spark_version}"
 
 # Add S3 jars
@@ -42,9 +38,6 @@ curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/"${aws_sdk
 # Add spark-xml jar
 curl https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/"${spark_xml_version}"/spark-xml_2.12-"${spark_xml_version}".jar -o "${SPARK_HOME}"/jars/spark-xml_2.12-"${spark_xml_version}".jar
 
-# Set up master IP address and executor memory
-cp "${SPARK_HOME}"/conf/spark-defaults.conf.template "${SPARK_HOME}"/conf/spark-defaults.conf
-
 # Install required libraries for GeoPandas on Apple chip mac
 apt-get install -y gdal-bin libgdal-dev
 

diff --git a/docs/api/stats/sql.md b/docs/api/stats/sql.md
@@ -49,3 +49,70 @@ names in parentheses are python variable names
 - geometry - name of the geometry column
 - handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
 - useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+The output is the input DataFrame with the lof added to each row.
+
+## Using Getis-Ord Gi(*)
+
+The G Local function is provided at `org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal` in scala/java and `sedona.stats.hotspot_detection.getis_ord.g_local` in python.
+
+Performs the Gi or Gi* statistic on the x column of the dataframe.
+
+Weights should be the neighbors of this row. The members of the weights should be comprised
+of structs containing a value column and a neighbor column. The neighbor column should be the
+contents of the neighbors with the same types as the parent row (minus neighbors). Reference the _Using the Distance
+Weighting Function_ header for instructions on generating this column. To calculate the Gi*
+statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the
+weights column) and `star=true`. Significance is calculated with a z score.
+
+### Parameters
+
+- dataframe - the dataframe to perform the G statistic on
+- x - The column name we want to perform hotspot analysis on
+- weights - The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use `Weighting` class functions to achieve this.
+- star - Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
+
+The output is the input DataFrame with the following columns added: G, E[G], V[G], Z, P.
+
+## Using the Distance Weighting Function
+
+The Weighting functions are provided at `org.apache.sedona.stats.Weighting` in scala/java and `sedona.stats.weighting` in python.
+
+The function generates a column containing an array of structs containing a value column and a neighbor column.
+
+The generic `addDistanceBandColumn` (`add_distance_band_column` in python) function annotates a dataframe with a weights column containing the other records within the threshold and their weight.
+
+The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
+geometry column is present it will be used automatically. If two are present, the one named
+'geometry' will be used. If more than one are present and neither is named 'geometry', the
+column name must be provided. The new column will be named 'cluster'.
+
+### Parameters
+
+#### addDistanceBandColumn
+
+names in parentheses are python variable names
+
+- dataframe - DataFrame with geometry column
+- threshold - Distance threshold for considering neighbors
+- binary - whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
+- alpha - alpha to use for inverse distance weights ignored when binary is true
+- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
+- includeSelf (include_self) - whether to include self in the list of neighbors
+- selfWeight (self_weight) - the value to use for the self weight
+- geometry - name of the geometry column
+- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+#### addBinaryDistanceBandColumn
+
+names in parentheses are python variable names
+
+- dataframe - DataFrame with geometry column
+- threshold - Distance threshold for considering neighbors
+- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
+- includeSelf (include_self) - whether to include self in the list of neighbors
+- selfWeight (self_weight) - the value to use for the self weight
+- geometry - name of the geometry column
+- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+In both cases the output is the input DataFrame with the weights column added to each row.
diff --git a/docs/community/contributor.md b/docs/community/contributor.md
@@ -40,7 +40,7 @@ The PMC regularly adds new committers from the active contributors, based on the
 
 * Sustained contributions to Sedona: Committers should have a history of major contributions to Sedona.
 * Quality of contributions: Committers more than any other community member should submit simple, well-tested, and well-designed patches. In addition, they should show sufficient expertise to be able to review patches.
-* Community involvement: Committers should have a constructive and friendly attitude in all community interactions. They should also be active on the dev mailing list & Gitter, and help mentor newer contributors and users.
+* Community involvement: Committers should have a constructive and friendly attitude in all community interactions. They should also be active on the dev mailing list & Discord, and help mentor newer contributors and users.
 
 The PMC also adds new PMC members. PMC members are expected to carry out PMC responsibilities as described in Apache Guidance, including helping vote on releases, enforce Apache project trademarks, take responsibility for legal and license issues, and ensure the project follows Apache project mechanics. The PMC periodically adds committers to the PMC who have shown they understand and can help with these activities.