sparkgeo: sparklyr extension package providing geospatial analytics capabilities

sparkgeo is a sparklyr extension package providing an integration with Magellan, a Spark package for geospatial analytics on big data.

Version Information

sparkgeo is under active development and has not been released yet to CRAN. You can install the latest version through

devtools::install_github("miraisolutions/sparkgeo", ref = "develop")

Example Usage

require(sparklyr)
require(sparkgeo)
require(dplyr)
require(tidyr)

# Download geojson file containing NYC neighborhood polygon information
# (http://data.beta.nyc/dataset/pediacities-nyc-neighborhoods)
geojson_file <- "nyc_neighborhoods.geojson"
download.file("https://goo.gl/eu1yWN", geojson_file)

# Start local Spark cluster
config <- spark_config()
sc <- spark_connect(master = "local", config = config)

# Register sparkgeo with Spark
sparkgeo_register(sc)

# Import data from GeoJSON file into a Spark DataFrame
neighborhoods <-
  spark_read_geojson(
    sc = sc,
    name = "neighborhoods",
    path = geojson_file
  ) %>%
  mutate(neighborhood = metadata_string(metadata, "neighborhood")) %>%
  select(neighborhood, polygon, index) %>%
  sdf_persist()

# Download and transform locations of New York City museums;
# see https://catalog.data.gov/dataset/new-york-city-museums
museums.df <- 
  read.csv("https://data.cityofnewyork.us/api/views/fn6f-htvy/rows.csv?accessType=DOWNLOAD",
           stringsAsFactors = FALSE) %>%
  extract(the_geom, into = c("longitude", "latitude"),
          regex = "POINT \\((-?\\d+\\.?\\d*) (-?\\d+\\.?\\d*)\\)") %>%
  mutate(longitude = as.numeric(longitude), latitude = as.numeric(latitude)) %>%
  rename(name = NAME) %>%
  select(name, latitude, longitude)
  
# Create Spark DataFrame from local R data.frame
museums <- copy_to(sc, museums.df, "museums")

# Perform a spatial join to associate museum coordinates to
# corresponding neighborhoods, then show the top 5 neighborhoods
# in terms of number of museums
museums %>%
  sdf_spatial_join(neighborhoods, latitude, longitude) %>%
  select(name, neighborhood) %>%
  group_by(neighborhood) %>%
  summarize(num_museums = n()) %>%
  top_n(5) %>%
  collect()

# Stop local Spark cluster
spark_disconnect(sc)

The last dplyr pipeline returns:

# A tibble: 5 x 2
        neighborhood num_museums
               <chr>       <dbl>
1    Upper East Side          13
2            Chelsea          11
3            Midtown           9
4               SoHo           8
5 Financial District           7

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
R		R
inst/java		inst/java
java		java
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
compile.R		compile.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sparkgeo: sparklyr extension package providing geospatial analytics capabilities

Version Information

Example Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

miraisolutions/sparkgeo

Folders and files

Latest commit

History

Repository files navigation

sparkgeo: sparklyr extension package providing geospatial analytics capabilities

Version Information

Example Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages