Skip to content

[SparkR] [Doc] SparkR ML user guides update for 2.0 #14011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions R/pkg/R/mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ setClass("KMeansModel", representation(jobj = "jobj"))

#' Generalized Linear Models
#'
#' Fits generalized linear model against a Spark DataFrame. Users can print, make predictions on the
#' produced model and save the model to the input path.
#' Fits generalized linear model against a Spark DataFrame.
#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make
#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models.
#'
#' @param data SparkDataFrame for training.
#' @param formula A symbolic description of the model to be fitted. Currently only a few formula
Expand Down Expand Up @@ -270,7 +271,8 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
#' K-Means Clustering Model
#'
#' Fits a k-means clustering model against a Spark DataFrame, similarly to R's kmeans().
#' Users can print, make predictions on the produced model and save the model to the input path.
#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make
#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models.
#'
#' @param data SparkDataFrame for training
#' @param formula A symbolic description of the model to be fitted. Currently only a few formula
Expand Down
43 changes: 25 additions & 18 deletions docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -355,32 +355,39 @@ head(teenagers)

# Machine Learning

SparkR supports the following Machine Learning algorithms.
SparkR supports the following machine learning algorithms currently: `Generalized Linear Model`, `Accelerated Failure Time (AFT) Survival Regression Model`, `Naive Bayes Model` and `KMeans Model`.
Under the hood, SparkR uses MLlib to train the model.
Users can call `summary` to print a summary of the fitted model, [predict](api/R/predict.html) to make predictions on new data, and [write.ml](api/R/write.ml.html)/[read.ml](api/R/read.ml.html) to save/load fitted models.
SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘.

* Generalized Linear Regression Model [spark.glm()](api/R/spark.glm.html)
* Naive Bayes [spark.naiveBayes()](api/R/spark.naiveBayes.html)
* KMeans [spark.kmeans()](api/R/spark.kmeans.html)
* AFT Survival Regression [spark.survreg()](api/R/spark.survreg.html)
## Algorithms

[Generalized Linear Regression](api/R/spark.glm.html) can be used to train a model from a specified family. Currently the Gaussian, Binomial, Poisson and Gamma families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', ':', '+', and '-'.
### Generalized Linear Model

The [summary()](api/R/summary.html) function gives the summary of a model produced by different algorithms listed above.
It produces the similar result compared with R summary function.
[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.
{% include_example glm r/ml.R %}

## Model persistence
### Accelerated Failure Time (AFT) Survival Regression Model

[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.
{% include_example survreg r/ml.R %}

### Naive Bayes Model

* [write.ml](api/R/write.ml.html) allows users to save a fitted model in a given input path
* [read.ml](api/R/read.ml.html) allows users to read/load the model which was saved using write.ml in a given path
[spark.naiveBayes()](api/R/spark.naiveBayes.html) fits a Bernoulli naive Bayes model against a SparkDataFrame. Only categorical data is supported.
{% include_example naiveBayes r/ml.R %}

Model persistence is supported for all Machine Learning algorithms for all families.
### KMeans Model

The examples below show how to build several models:
* GLM using the Gaussian and Binomial model families
* AFT survival regression model
* Naive Bayes model
* K-Means model
[spark.kmeans()](api/R/spark.kmeans.html) fits a k-means clustering model against a Spark DataFrame, similarly to R's kmeans().
{% include_example kmeans r/ml.R %}

## Model persistence

{% include_example r/ml.R %}
The following example shows how to save/load a MLlib model by SparkR.
{% include_example read_write r/ml.R %}

# R Function Name Conflicts

Expand Down
22 changes: 11 additions & 11 deletions examples/src/main/r/ml.R
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,8 @@ library(SparkR)
# Initialize SparkSession
sparkR.session(appName="SparkR-ML-example")

# $example on$
############################ spark.glm and glm ##############################################

# $example on:glm$
irisDF <- suppressWarnings(createDataFrame(iris))
# Fit a generalized linear model of family "gaussian" with spark.glm
gaussianDF <- irisDF
Expand Down Expand Up @@ -55,8 +54,9 @@ summary(binomialGLM)
# Prediction
binomialPredictions <- predict(binomialGLM, binomialTestDF)
showDF(binomialPredictions)

# $example off:glm$
############################ spark.survreg ##############################################
# $example on:survreg$
# Use the ovarian dataset available in R survival package
library(survival)

Expand All @@ -72,9 +72,9 @@ summary(aftModel)
# Prediction
aftPredictions <- predict(aftModel, aftTestDF)
showDF(aftPredictions)

# $example off:survreg$
############################ spark.naiveBayes ##############################################

# $example on:naiveBayes$
# Fit a Bernoulli naive Bayes model with spark.naiveBayes
titanic <- as.data.frame(Titanic)
titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
Expand All @@ -88,9 +88,9 @@ summary(nbModel)
# Prediction
nbPredictions <- predict(nbModel, nbTestDF)
showDF(nbPredictions)

# $example off:naiveBayes$
############################ spark.kmeans ##############################################

# $example on:kmeans$
# Fit a k-means model with spark.kmeans
irisDF <- suppressWarnings(createDataFrame(iris))
kmeansDF <- irisDF
Expand All @@ -107,9 +107,9 @@ showDF(fitted(kmeansModel))
# Prediction
kmeansPredictions <- predict(kmeansModel, kmeansTestDF)
showDF(kmeansPredictions)

# $example off:kmeans$
############################ model read/write ##############################################

# $example on:read_write$
irisDF <- suppressWarnings(createDataFrame(iris))
# Fit a generalized linear model of family "gaussian" with spark.glm
gaussianDF <- irisDF
Expand All @@ -120,7 +120,7 @@ gaussianGLM <- spark.glm(gaussianDF, Sepal_Length ~ Sepal_Width + Species, famil
modelPath <- tempfile(pattern = "ml", fileext = ".tmp")
write.ml(gaussianGLM, modelPath)
gaussianGLM2 <- read.ml(modelPath)
# $example off$

# Check model summary
summary(gaussianGLM2)

Expand All @@ -129,7 +129,7 @@ gaussianPredictions <- predict(gaussianGLM2, gaussianTestDF)
showDF(gaussianPredictions)

unlink(modelPath)

# $example off:read_write$
############################ fit models with spark.lapply #####################################

# Perform distributed training of multiple models with spark.lapply
Expand Down