You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Naive Bayes model assumes independence among the features. `spark.naiveBayes` fits a [Bernoulli naive Bayes model](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Bernoulli_naive_Bayes) against a SparkDataFrame. The data should be all categorical. These models are often used for document classification.
@@ -565,8 +599,6 @@ head(aftPredictions)
565
599
566
600
#### Gaussian Mixture Model
567
601
568
-
(Coming in 2.1.0)
569
-
570
602
`spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) (GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) is used to approximate the maximum likelihood estimator (MLE) of the model.
571
603
572
604
We use a simulated example to demostrate the usage.
`spark.lda` fits a [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) model on a `SparkDataFrame`. It is often used in topic modeling in which topics are inferred from a collection of text documents. LDA can be thought of as a clustering algorithm as follows:
590
620
591
621
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
@@ -600,22 +630,6 @@ To use LDA, we need to specify a `features` column in `data` where each entry re
600
630
601
631
* libSVM: Each entry is a collection of words and will be processed directly.
602
632
603
-
There are several parameters LDA takes for fitting the model.
604
-
605
-
*`k`: number of topics (default 10).
606
-
607
-
*`maxIter`: maximum iterations (default 20).
608
-
609
-
*`optimizer`: optimizer to train an LDA model, "online" (default) uses [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf). "em" uses [expectation-maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).
610
-
611
-
*`subsamplingRate`: For `optimizer = "online"`. Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1] (default 0.05).
612
-
613
-
*`topicConcentration`: concentration parameter (commonly named beta or eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective topicConcentration. Only 1-size numeric is accepted.
614
-
615
-
*`docConcentration`: concentration parameter (commonly named alpha) for the prior placed on documents distributions over topics (theta), default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective docConcentration. Only 1-size or k-size numeric is accepted.
616
-
617
-
*`maxVocabSize`: maximum vocabulary size, default 1 << 18.
618
-
619
633
Two more functions are provided for the fitted model.
620
634
621
635
*`spark.posterior` returns a `SparkDataFrame` containing a column of posterior probabilities vectors named "topicDistribution".
Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights $w$ and bias $b$ and applying an activation function. This can be written in matrix form for MLPC with $K+1$ layers as follows:
@@ -678,24 +689,35 @@ The number of nodes $N$ in the output layer corresponds to the number of classes
678
689
679
690
MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.
680
691
681
-
`spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format. According to the description above, there are several additional parameters that can be set:
692
+
`spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format.
682
693
683
-
*`layers`: integer vector containing the number of nodes for each layer.
*`seed`: seed parameter for weights initialization.
701
+
To avoid lengthy display, we only present partial results of the model summary. You can check the full result from your sparkR shell.
702
+
```{r, include=FALSE}
703
+
ops <- options()
704
+
options(max.print=5)
705
+
```
706
+
```{r}
707
+
# check the summary of the fitted model
708
+
summary(model)
709
+
```
710
+
```{r, include=FALSE}
711
+
options(ops)
712
+
```
713
+
```{r}
714
+
# make predictions use the fitted model
715
+
predictions <- predict(model, df)
716
+
head(select(predictions, predictions$prediction))
717
+
```
694
718
695
719
#### Collaborative Filtering
696
720
697
-
(Coming in 2.1.0)
698
-
699
721
`spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614).
700
722
701
723
There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, `nonnegative`. For a complete list, refer to the help file.
@@ -725,8 +747,6 @@ head(predicted)
725
747
726
748
#### Isotonic Regression Model
727
749
728
-
(Coming in 2.1.0)
729
-
730
750
`spark.isoreg` fits an [Isotonic Regression](https://en.wikipedia.org/wiki/Isotonic_regression) model against a `SparkDataFrame`. It solves a weighted univariate a regression problem under a complete order constraint. Specifically, given a set of real observed responses $y_1, \ldots, y_n$, corresponding real features $x_1, \ldots, x_n$, and optionally positive weights $w_1, \ldots, w_n$, we want to find a monotone (piecewise linear) function $f$ to minimize
We also expect Decision Tree, Random Forest, Kolmogorov-Smirnov Test coming in the next version 2.1.0.
791
+
#### Logistic Regression Model
792
+
793
+
[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Predictive Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
794
+
We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters.
795
+
It supports both binary and multiclass classification with elastic-net regularization and feature standardization, similar to `glmnet`.
796
+
797
+
We use a simple example to demonstrate `spark.logit` usage. In general, there are three steps of using `spark.logit`:
798
+
1). Create a dataframe from a proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting;
799
+
and 3). Obtain the coefficient matrix of the fitted model using `summary` and use the model for prediction with `predict`.
800
+
801
+
Binomial logistic regression
802
+
```{r, warning=FALSE}
803
+
df <- createDataFrame(iris)
804
+
# Create a DataFrame containing two classes
805
+
training <- df[df$Species %in% c("versicolor", "virginica"), ]
806
+
model <- spark.logit(training, Species ~ ., regParam = 0.00042)
807
+
summary(model)
808
+
```
809
+
810
+
Predict values on training data
811
+
```{r}
812
+
fitted <- predict(model, training)
813
+
```
814
+
815
+
Multinomial logistic regression against three classes
816
+
```{r, warning=FALSE}
817
+
df <- createDataFrame(iris)
818
+
# Note in this case, Spark infers it is multinomial logistic regression, so family = "multinomial" is optional.
819
+
model <- spark.logit(df, Species ~ ., regParam = 0.056)
820
+
summary(model)
821
+
```
822
+
823
+
#### Kolmogorov-Smirnov Test
824
+
825
+
`spark.kstest` runs a two-sided, one-sample [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).
826
+
Given a `SparkDataFrame`, the test compares continuous data in a given column `testCol` with the theoretical distribution
827
+
specified by parameter `nullHypothesis`.
828
+
Users can call `summary` to get a summary of the test results.
829
+
830
+
In the following example, we test whether the `longley` dataset's `Armed_Forces` column
831
+
follows a normal distribution. We set the parameters of the normal distribution using
0 commit comments