Skip to content

Commit

Permalink
Merge branch 'apache:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
jiayuasu authored Nov 5, 2024
2 parents 3e4d8e2 + fcea411 commit 649cafc
Show file tree
Hide file tree
Showing 37 changed files with 1,611 additions and 41 deletions.
16 changes: 16 additions & 0 deletions .github/workflows/license-templates/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
17 changes: 12 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,22 @@ repos:
hooks:
- id: identity
- id: check-hooks-apply
- repo: https://github.com/Lucas-C/pre-commit-hooks
rev: v1.5.5
hooks:
- id: insert-license
name: Add license for all TOML files
files: \.toml$
args:
- --comment-style
- "|#|"
- --license-filepath
- .github/workflows/license-templates/LICENSE.txt
- --fuzzy-match-generates-todo
- repo: https://github.com/psf/black-pre-commit-mirror
rev: 24.10.0
hooks:
- id: black-jupyter
# - repo: https://github.com/pycqa/isort
# rev: 5.13.2
# hooks:
# - id: isort
# name: isort (python)
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v19.1.1
hooks:
Expand Down
4 changes: 2 additions & 2 deletions R/vignettes/articles/apache-sedona.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ file in a supported geospatial format (`sedona_read_*` functions), or by extract
Spark SQL query.

For example, the following code will import data from
[arealm-small.csv](https://github.com/apache/sedona/blob/master/binder/data/arealm-small.csv)
[arealm-small.csv](https://github.com/apache/sedona/blob/master/docs/usecases/data/arealm-small.csv)
into a `SpatialRDD`:

```{r}
Expand All @@ -311,7 +311,7 @@ pt_rdd <- sedona_read_dsv_to_typed_rdd(
```

Records from the example
[arealm-small.csv](https://github.com/apache/sedona/blob/master/binder/data/arealm-small.csv)
[arealm-small.csv](https://github.com/apache/sedona/blob/master/docs/usecases/data/arealm-small.csv)
file look like the following:

testattribute0,-88.331492,32.324142,testattribute1,testattribute2
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,25 +34,26 @@ Join the Sedona monthly community office hour: [Google Calendar](https://calenda

## What is Apache Sedona?

Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as Apache Spark and Apache Flink. Sedona developers can express their spatial data processing tasks in Spatial SQL, Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.
Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as [Apache Spark](https://spark.apache.org/) and [Apache Flink](https://flink.apache.org/).
Sedona developers can express their spatial data processing tasks in [Spatial SQL](https://carto.com/spatial-sql), Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.

![Sedona Ecosystem](docs/image/sedona-ecosystem.png "Sedona Ecosystem")

### Features

Some of the key features of Apache Sedona include:

* Support for a wide range of geospatial data formats, including GeoJSON, WKT, and ESRI Shapefile.
* Support for a wide range of geospatial data formats, including [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON), [WKT](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry), and [ESRI](https://www.esri.com) [Shapefile](https://en.wikipedia.org/wiki/Shapefile).
* Scalable distributed processing of large vector and raster datasets.
* Tools for spatial indexing, spatial querying, and spatial join operations.
* Integration with popular geospatial python tools such as GeoPandas.
* Integration with popular big data tools, such as Spark, Hadoop, Hive, and Flink for data storage and querying.
* A user-friendly API for working with geospatial data in the SQL, Python, Scala and Java languages.
* Integration with popular geospatial Python tools such as [GeoPandas](https://geopandas.org).
* Integration with popular big data tools, such as Spark, [Hadoop](https://hadoop.apache.org/), [Hive](https://hive.apache.org/), and Flink for data storage and querying.
* A user-friendly API for working with geospatial data in the [SQL](https://en.wikipedia.org/wiki/SQL), [Python](https://www.python.org/), [Scala](https://www.scala-lang.org/) and [Java](https://www.java.com) languages.
* Flexible deployment options, including standalone, local, and cluster modes.

These are some of the key features of Apache Sedona, but it may offer additional capabilities depending on the specific version and configuration.

Click [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/apache/sedona/HEAD?filepath=docs/usecases) and play the interactive Sedona Python Jupyter Notebook immediately!
Click [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/apache/sedona/HEAD?filepath=docs/usecases) and play the interactive Sedona Python [Jupyter](https://jupyter.org/) Notebook immediately!

## When to use Sedona?

Expand Down Expand Up @@ -150,5 +151,6 @@ Please visit [Apache Sedona website](http://sedona.apache.org/) for detailed inf
## Powered by

<a href="https://www.apache.org/">
<img alt="The Apache Software Foundation" src="https://www.apache.org/foundation/press/kit/asf_logo_wide.png" width="500" class="center">
<img alt="The Apache Software Foundation" class="center" src="https://www.apache.org/foundation/press/kit/asf_logo_wide.png"
title="The Apache Software Foundation" width="500">
</a>
2 changes: 1 addition & 1 deletion docker/sedona-spark-jupyterlab/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
attrs
descartes
fiona==1.8.22
fiona==1.10.1
geopandas==0.14.4
ipykernel
ipywidgets
Expand Down
6 changes: 2 additions & 4 deletions docker/sedona-spark-jupyterlab/sedona-jupyterlab.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ FROM ubuntu:22.04

ARG shared_workspace=/opt/workspace
ARG spark_version=3.4.1
ARG hadoop_version=3
ARG hadoop_s3_version=3.3.4
ARG aws_sdk_version=1.12.402
ARG spark_xml_version=0.16.0
Expand All @@ -29,8 +28,7 @@ ARG spark_extension_version=2.11.0

# Set up envs
ENV SHARED_WORKSPACE=${shared_workspace}
ENV SPARK_HOME /opt/spark
RUN mkdir ${SPARK_HOME}
ENV SPARK_HOME /usr/local/lib/python3.10/dist-packages/pyspark
ENV SEDONA_HOME /opt/sedona
RUN mkdir ${SEDONA_HOME}

Expand All @@ -44,7 +42,7 @@ COPY ./ ${SEDONA_HOME}/

RUN chmod +x ${SEDONA_HOME}/docker/spark.sh
RUN chmod +x ${SEDONA_HOME}/docker/sedona.sh
RUN ${SEDONA_HOME}/docker/spark.sh ${spark_version} ${hadoop_version} ${hadoop_s3_version} ${aws_sdk_version} ${spark_xml_version}
RUN ${SEDONA_HOME}/docker/spark.sh ${spark_version} ${hadoop_s3_version} ${aws_sdk_version} ${spark_xml_version}

# Install Python dependencies
COPY docker/sedona-spark-jupyterlab/requirements.txt /opt/requirements.txt
Expand Down
13 changes: 3 additions & 10 deletions docker/spark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,16 @@ set -e

# Define variables
spark_version=$1
hadoop_version=$2
hadoop_s3_version=$3
aws_sdk_version=$4
spark_xml_version=$5
hadoop_s3_version=$2
aws_sdk_version=$3
spark_xml_version=$4

# Set up OS libraries
apt-get update
apt-get install -y openjdk-19-jdk-headless curl python3-pip maven
pip3 install --upgrade pip && pip3 install pipenv

# Download Spark jar and set up PySpark
curl https://archive.apache.org/dist/spark/spark-"${spark_version}"/spark-"${spark_version}"-bin-hadoop"${hadoop_version}".tgz -o spark.tgz
tar -xf spark.tgz && mv spark-"${spark_version}"-bin-hadoop"${hadoop_version}"/* "${SPARK_HOME}"/
rm spark.tgz && rm -rf spark-"${spark_version}"-bin-hadoop"${hadoop_version}"
pip3 install pyspark=="${spark_version}"

# Add S3 jars
Expand All @@ -42,9 +38,6 @@ curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/"${aws_sdk
# Add spark-xml jar
curl https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/"${spark_xml_version}"/spark-xml_2.12-"${spark_xml_version}".jar -o "${SPARK_HOME}"/jars/spark-xml_2.12-"${spark_xml_version}".jar

# Set up master IP address and executor memory
cp "${SPARK_HOME}"/conf/spark-defaults.conf.template "${SPARK_HOME}"/conf/spark-defaults.conf

# Install required libraries for GeoPandas on Apple chip mac
apt-get install -y gdal-bin libgdal-dev

Expand Down
67 changes: 67 additions & 0 deletions docs/api/stats/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,70 @@ names in parentheses are python variable names
- geometry - name of the geometry column
- handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

The output is the input DataFrame with the lof added to each row.

## Using Getis-Ord Gi(*)

The G Local function is provided at `org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal` in scala/java and `sedona.stats.hotspot_detection.getis_ord.g_local` in python.

Performs the Gi or Gi* statistic on the x column of the dataframe.

Weights should be the neighbors of this row. The members of the weights should be comprised
of structs containing a value column and a neighbor column. The neighbor column should be the
contents of the neighbors with the same types as the parent row (minus neighbors). Reference the _Using the Distance
Weighting Function_ header for instructions on generating this column. To calculate the Gi*
statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the
weights column) and `star=true`. Significance is calculated with a z score.

### Parameters

- dataframe - the dataframe to perform the G statistic on
- x - The column name we want to perform hotspot analysis on
- weights - The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use `Weighting` class functions to achieve this.
- star - Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi

The output is the input DataFrame with the following columns added: G, E[G], V[G], Z, P.

## Using the Distance Weighting Function

The Weighting functions are provided at `org.apache.sedona.stats.Weighting` in scala/java and `sedona.stats.weighting` in python.

The function generates a column containing an array of structs containing a value column and a neighbor column.

The generic `addDistanceBandColumn` (`add_distance_band_column` in python) function annotates a dataframe with a weights column containing the other records within the threshold and their weight.

The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and neither is named 'geometry', the
column name must be provided. The new column will be named 'cluster'.

### Parameters

#### addDistanceBandColumn

names in parentheses are python variable names

- dataframe - DataFrame with geometry column
- threshold - Distance threshold for considering neighbors
- binary - whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
- alpha - alpha to use for inverse distance weights ignored when binary is true
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
- includeSelf (include_self) - whether to include self in the list of neighbors
- selfWeight (self_weight) - the value to use for the self weight
- geometry - name of the geometry column
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

#### addBinaryDistanceBandColumn

names in parentheses are python variable names

- dataframe - DataFrame with geometry column
- threshold - Distance threshold for considering neighbors
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
- includeSelf (include_self) - whether to include self in the list of neighbors
- selfWeight (self_weight) - the value to use for the self weight
- geometry - name of the geometry column
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

In both cases the output is the input DataFrame with the weights column added to each row.
2 changes: 1 addition & 1 deletion docs/community/contributor.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ The PMC regularly adds new committers from the active contributors, based on the

* Sustained contributions to Sedona: Committers should have a history of major contributions to Sedona.
* Quality of contributions: Committers more than any other community member should submit simple, well-tested, and well-designed patches. In addition, they should show sufficient expertise to be able to review patches.
* Community involvement: Committers should have a constructive and friendly attitude in all community interactions. They should also be active on the dev mailing list & Gitter, and help mentor newer contributors and users.
* Community involvement: Committers should have a constructive and friendly attitude in all community interactions. They should also be active on the dev mailing list & Discord, and help mentor newer contributors and users.

The PMC also adds new PMC members. PMC members are expected to carry out PMC responsibilities as described in Apache Guidance, including helping vote on releases, enforce Apache project trademarks, take responsibility for legal and license issues, and ensure the project follows Apache project mechanics. The PMC periodically adds committers to the PMC who have shown they understand and can help with these activities.

Expand Down
Loading

0 comments on commit 649cafc

Please sign in to comment.