Skip to content

Commit ee4aa1d

Browse files
committed
Merge branch 'master' into SPARK-23429.2
2 parents 2897281 + da6fa38 commit ee4aa1d

File tree

576 files changed

+15451
-6397
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

576 files changed

+15451
-6397
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ target/
7777
unit-tests.log
7878
work/
7979
docs/.jekyll-metadata
80+
*.crc
8081

8182
# For Hive
8283
TempStatsStore/

LICENSE-binary

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -228,7 +228,6 @@ org.apache.xbean:xbean-asm5-shaded
228228
com.squareup.okhttp3:logging-interceptor
229229
com.squareup.okhttp3:okhttp
230230
com.squareup.okio:okio
231-
net.java.dev.jets3t:jets3t
232231
org.apache.spark:spark-catalyst_2.11
233232
org.apache.spark:spark-kvstore_2.11
234233
org.apache.spark:spark-launcher_2.11
@@ -447,7 +446,6 @@ org.slf4j:jul-to-slf4j
447446
org.slf4j:slf4j-api
448447
org.slf4j:slf4j-log4j12
449448
com.github.scopt:scopt_2.11
450-
org.bouncycastle:bcprov-jdk15on
451449

452450
core/src/main/resources/org/apache/spark/ui/static/dagre-d3.min.js
453451
core/src/main/resources/org/apache/spark/ui/static/*dataTables*

NOTICE

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,5 +26,3 @@ The following provides more details on the included cryptographic software:
2626
This software uses Apache Commons Crypto (https://commons.apache.org/proper/commons-crypto/) to
2727
support authentication, and encryption and decryption of data sent across the network between
2828
services.
29-
30-
This software includes Bouncy Castle (http://bouncycastle.org/) to support the jets3t library.

NOTICE-binary

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,6 @@ This software uses Apache Commons Crypto (https://commons.apache.org/proper/comm
2727
support authentication, and encryption and decryption of data sent across the network between
2828
services.
2929

30-
This software includes Bouncy Castle (http://bouncycastle.org/) to support the jets3t library.
31-
3230

3331
// ------------------------------------------------------------------
3432
// NOTICE file corresponding to the section 4d of The Apache License,
@@ -1162,25 +1160,6 @@ NonlinearMinimizer class in package breeze.optimize.proximal is distributed with
11621160
2015, Debasish Das (Verizon), all rights reserved.
11631161

11641162

1165-
=========================================================================
1166-
== NOTICE file corresponding to section 4(d) of the Apache License, ==
1167-
== Version 2.0, in this case for the distribution of jets3t. ==
1168-
=========================================================================
1169-
1170-
This product includes software developed by:
1171-
1172-
The Apache Software Foundation (http://www.apache.org/).
1173-
1174-
The ExoLab Project (http://www.exolab.org/)
1175-
1176-
Sun Microsystems (http://www.sun.com/)
1177-
1178-
Codehaus (http://castor.codehaus.org)
1179-
1180-
Tatu Saloranta (http://wiki.fasterxml.com/TatuSaloranta)
1181-
1182-
1183-
11841163
stream-lib
11851164
Copyright 2016 AddThis
11861165

R/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ export R_HOME=/home/username/R
1717

1818
#### Build Spark
1919

20-
Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
20+
Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
2121

2222
```bash
2323
build/mvn -DskipTests -Psparkr package

R/WINDOWS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ directory in Maven in `PATH`.
1414

1515
4. Set `MAVEN_OPTS` as described in [Building Spark](http://spark.apache.org/docs/latest/building-spark.html).
1616

17-
5. Open a command shell (`cmd`) in the Spark directory and build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
17+
5. Open a command shell (`cmd`) in the Spark directory and build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
1818

1919
```bash
2020
mvn.cmd -DskipTests -Psparkr package

R/pkg/NAMESPACE

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ exportMethods("arrange",
117117
"dropna",
118118
"dtypes",
119119
"except",
120+
"exceptAll",
120121
"explain",
121122
"fillna",
122123
"filter",
@@ -131,6 +132,7 @@ exportMethods("arrange",
131132
"hint",
132133
"insertInto",
133134
"intersect",
135+
"intersectAll",
134136
"isLocal",
135137
"isStreaming",
136138
"join",
@@ -202,6 +204,8 @@ exportMethods("%<=>%",
202204
"approxQuantile",
203205
"array_contains",
204206
"array_distinct",
207+
"array_except",
208+
"array_intersect",
205209
"array_join",
206210
"array_max",
207211
"array_min",
@@ -210,6 +214,7 @@ exportMethods("%<=>%",
210214
"array_repeat",
211215
"array_sort",
212216
"arrays_overlap",
217+
"array_union",
213218
"arrays_zip",
214219
"asc",
215220
"ascii",
@@ -353,6 +358,7 @@ exportMethods("%<=>%",
353358
"shiftLeft",
354359
"shiftRight",
355360
"shiftRightUnsigned",
361+
"shuffle",
356362
"sd",
357363
"sign",
358364
"signum",

R/pkg/R/DataFrame.R

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2848,6 +2848,35 @@ setMethod("intersect",
28482848
dataFrame(intersected)
28492849
})
28502850

2851+
#' intersectAll
2852+
#'
2853+
#' Return a new SparkDataFrame containing rows in both this SparkDataFrame
2854+
#' and another SparkDataFrame while preserving the duplicates.
2855+
#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in
2856+
#' SQL, this function resolves columns by position (not by name).
2857+
#'
2858+
#' @param x a SparkDataFrame.
2859+
#' @param y a SparkDataFrame.
2860+
#' @return A SparkDataFrame containing the result of the intersect all operation.
2861+
#' @family SparkDataFrame functions
2862+
#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method
2863+
#' @rdname intersectAll
2864+
#' @name intersectAll
2865+
#' @examples
2866+
#'\dontrun{
2867+
#' sparkR.session()
2868+
#' df1 <- read.json(path)
2869+
#' df2 <- read.json(path2)
2870+
#' intersectAllDF <- intersectAll(df1, df2)
2871+
#' }
2872+
#' @note intersectAll since 2.4.0
2873+
setMethod("intersectAll",
2874+
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
2875+
function(x, y) {
2876+
intersected <- callJMethod(x@sdf, "intersectAll", y@sdf)
2877+
dataFrame(intersected)
2878+
})
2879+
28512880
#' except
28522881
#'
28532882
#' Return a new SparkDataFrame containing rows in this SparkDataFrame
@@ -2867,7 +2896,6 @@ setMethod("intersect",
28672896
#' df2 <- read.json(path2)
28682897
#' exceptDF <- except(df, df2)
28692898
#' }
2870-
#' @rdname except
28712899
#' @note except since 1.4.0
28722900
setMethod("except",
28732901
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
@@ -2876,6 +2904,35 @@ setMethod("except",
28762904
dataFrame(excepted)
28772905
})
28782906

2907+
#' exceptAll
2908+
#'
2909+
#' Return a new SparkDataFrame containing rows in this SparkDataFrame
2910+
#' but not in another SparkDataFrame while preserving the duplicates.
2911+
#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in
2912+
#' SQL, this function resolves columns by position (not by name).
2913+
#'
2914+
#' @param x a SparkDataFrame.
2915+
#' @param y a SparkDataFrame.
2916+
#' @return A SparkDataFrame containing the result of the except all operation.
2917+
#' @family SparkDataFrame functions
2918+
#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method
2919+
#' @rdname exceptAll
2920+
#' @name exceptAll
2921+
#' @examples
2922+
#'\dontrun{
2923+
#' sparkR.session()
2924+
#' df1 <- read.json(path)
2925+
#' df2 <- read.json(path2)
2926+
#' exceptAllDF <- exceptAll(df1, df2)
2927+
#' }
2928+
#' @note exceptAll since 2.4.0
2929+
setMethod("exceptAll",
2930+
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
2931+
function(x, y) {
2932+
excepted <- callJMethod(x@sdf, "exceptAll", y@sdf)
2933+
dataFrame(excepted)
2934+
})
2935+
28792936
#' Save the contents of SparkDataFrame to a data source.
28802937
#'
28812938
#' The data source is specified by the \code{source} and a set of options (...).

R/pkg/R/context.R

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -138,11 +138,10 @@ parallelize <- function(sc, coll, numSlices = 1) {
138138

139139
sizeLimit <- getMaxAllocationLimit(sc)
140140
objectSize <- object.size(coll)
141+
len <- length(coll)
141142

142143
# For large objects we make sure the size of each slice is also smaller than sizeLimit
143-
numSerializedSlices <- max(numSlices, ceiling(objectSize / sizeLimit))
144-
if (numSerializedSlices > length(coll))
145-
numSerializedSlices <- length(coll)
144+
numSerializedSlices <- min(len, max(numSlices, ceiling(objectSize / sizeLimit)))
146145

147146
# Generate the slice ids to put each row
148147
# For instance, for numSerializedSlices of 22, length of 50
@@ -153,8 +152,8 @@ parallelize <- function(sc, coll, numSlices = 1) {
153152
splits <- if (numSerializedSlices > 0) {
154153
unlist(lapply(0: (numSerializedSlices - 1), function(x) {
155154
# nolint start
156-
start <- trunc((x * length(coll)) / numSerializedSlices)
157-
end <- trunc(((x + 1) * length(coll)) / numSerializedSlices)
155+
start <- trunc((as.numeric(x) * len) / numSerializedSlices)
156+
end <- trunc(((as.numeric(x) + 1) * len) / numSerializedSlices)
158157
# nolint end
159158
rep(start, end - start)
160159
}))

R/pkg/R/functions.R

Lines changed: 60 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ NULL
208208
#' # Dataframe used throughout this doc
209209
#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
210210
#' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp))
211-
#' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1)))
211+
#' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1), shuffle(tmp$v1)))
212212
#' head(select(tmp, array_max(tmp$v1), array_min(tmp$v1), array_distinct(tmp$v1)))
213213
#' head(select(tmp, array_position(tmp$v1, 21), array_repeat(df$mpg, 3), array_sort(tmp$v1)))
214214
#' head(select(tmp, flatten(tmp$v1), reverse(tmp$v1), array_remove(tmp$v1, 21)))
@@ -223,6 +223,8 @@ NULL
223223
#' head(select(tmp3, element_at(tmp3$v3, "Valiant")))
224224
#' tmp4 <- mutate(df, v4 = create_array(df$mpg, df$cyl), v5 = create_array(df$cyl, df$hp))
225225
#' head(select(tmp4, concat(tmp4$v4, tmp4$v5), arrays_overlap(tmp4$v4, tmp4$v5)))
226+
#' head(select(tmp4, array_except(tmp4$v4, tmp4$v5), array_intersect(tmp4$v4, tmp4$v5)))
227+
#' head(select(tmp4, array_union(tmp4$v4, tmp4$v5)))
226228
#' head(select(tmp4, arrays_zip(tmp4$v4, tmp4$v5), map_from_arrays(tmp4$v4, tmp4$v5)))
227229
#' head(select(tmp, concat(df$mpg, df$cyl, df$hp)))
228230
#' tmp5 <- mutate(df, v6 = create_array(df$model, df$model))
@@ -1697,8 +1699,8 @@ setMethod("to_date",
16971699
})
16981700

16991701
#' @details
1700-
#' \code{to_json}: Converts a column containing a \code{structType}, array of \code{structType},
1701-
#' a \code{mapType} or array of \code{mapType} into a Column of JSON string.
1702+
#' \code{to_json}: Converts a column containing a \code{structType}, a \code{mapType}
1703+
#' or an \code{arrayType} into a Column of JSON string.
17021704
#' Resolving the Column can fail if an unsupported type is encountered.
17031705
#'
17041706
#' @rdname column_collection_functions
@@ -3024,6 +3026,34 @@ setMethod("array_distinct",
30243026
column(jc)
30253027
})
30263028

3029+
#' @details
3030+
#' \code{array_except}: Returns an array of the elements in the first array but not in the second
3031+
#' array, without duplicates. The order of elements in the result is not determined.
3032+
#'
3033+
#' @rdname column_collection_functions
3034+
#' @aliases array_except array_except,Column-method
3035+
#' @note array_except since 2.4.0
3036+
setMethod("array_except",
3037+
signature(x = "Column", y = "Column"),
3038+
function(x, y) {
3039+
jc <- callJStatic("org.apache.spark.sql.functions", "array_except", x@jc, y@jc)
3040+
column(jc)
3041+
})
3042+
3043+
#' @details
3044+
#' \code{array_intersect}: Returns an array of the elements in the intersection of the given two
3045+
#' arrays, without duplicates.
3046+
#'
3047+
#' @rdname column_collection_functions
3048+
#' @aliases array_intersect array_intersect,Column-method
3049+
#' @note array_intersect since 2.4.0
3050+
setMethod("array_intersect",
3051+
signature(x = "Column", y = "Column"),
3052+
function(x, y) {
3053+
jc <- callJStatic("org.apache.spark.sql.functions", "array_intersect", x@jc, y@jc)
3054+
column(jc)
3055+
})
3056+
30273057
#' @details
30283058
#' \code{array_join}: Concatenates the elements of column using the delimiter.
30293059
#' Null values are replaced with nullReplacement if set, otherwise they are ignored.
@@ -3149,6 +3179,20 @@ setMethod("arrays_overlap",
31493179
column(jc)
31503180
})
31513181

3182+
#' @details
3183+
#' \code{array_union}: Returns an array of the elements in the union of the given two arrays,
3184+
#' without duplicates.
3185+
#'
3186+
#' @rdname column_collection_functions
3187+
#' @aliases array_union array_union,Column-method
3188+
#' @note array_union since 2.4.0
3189+
setMethod("array_union",
3190+
signature(x = "Column", y = "Column"),
3191+
function(x, y) {
3192+
jc <- callJStatic("org.apache.spark.sql.functions", "array_union", x@jc, y@jc)
3193+
column(jc)
3194+
})
3195+
31523196
#' @details
31533197
#' \code{arrays_zip}: Returns a merged array of structs in which the N-th struct contains all N-th
31543198
#' values of input arrays.
@@ -3167,6 +3211,19 @@ setMethod("arrays_zip",
31673211
column(jc)
31683212
})
31693213

3214+
#' @details
3215+
#' \code{shuffle}: Returns a random permutation of the given array.
3216+
#'
3217+
#' @rdname column_collection_functions
3218+
#' @aliases shuffle shuffle,Column-method
3219+
#' @note shuffle since 2.4.0
3220+
setMethod("shuffle",
3221+
signature(x = "Column"),
3222+
function(x) {
3223+
jc <- callJStatic("org.apache.spark.sql.functions", "shuffle", x@jc)
3224+
column(jc)
3225+
})
3226+
31703227
#' @details
31713228
#' \code{flatten}: Creates a single array from an array of arrays.
31723229
#' If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

R/pkg/R/generics.R

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,9 @@ setGeneric("explain", function(x, ...) { standardGeneric("explain") })
471471
#' @rdname except
472472
setGeneric("except", function(x, y) { standardGeneric("except") })
473473

474+
#' @rdname exceptAll
475+
setGeneric("exceptAll", function(x, y) { standardGeneric("exceptAll") })
476+
474477
#' @rdname nafunctions
475478
setGeneric("fillna", function(x, value, cols = NULL) { standardGeneric("fillna") })
476479

@@ -495,6 +498,9 @@ setGeneric("insertInto", function(x, tableName, ...) { standardGeneric("insertIn
495498
#' @rdname intersect
496499
setGeneric("intersect", function(x, y) { standardGeneric("intersect") })
497500

501+
#' @rdname intersectAll
502+
setGeneric("intersectAll", function(x, y) { standardGeneric("intersectAll") })
503+
498504
#' @rdname isLocal
499505
setGeneric("isLocal", function(x) { standardGeneric("isLocal") })
500506

@@ -761,6 +767,14 @@ setGeneric("array_contains", function(x, value) { standardGeneric("array_contain
761767
#' @name NULL
762768
setGeneric("array_distinct", function(x) { standardGeneric("array_distinct") })
763769

770+
#' @rdname column_collection_functions
771+
#' @name NULL
772+
setGeneric("array_except", function(x, y) { standardGeneric("array_except") })
773+
774+
#' @rdname column_collection_functions
775+
#' @name NULL
776+
setGeneric("array_intersect", function(x, y) { standardGeneric("array_intersect") })
777+
764778
#' @rdname column_collection_functions
765779
#' @name NULL
766780
setGeneric("array_join", function(x, delimiter, ...) { standardGeneric("array_join") })
@@ -793,6 +807,10 @@ setGeneric("array_sort", function(x) { standardGeneric("array_sort") })
793807
#' @name NULL
794808
setGeneric("arrays_overlap", function(x, y) { standardGeneric("arrays_overlap") })
795809

810+
#' @rdname column_collection_functions
811+
#' @name NULL
812+
setGeneric("array_union", function(x, y) { standardGeneric("array_union") })
813+
796814
#' @rdname column_collection_functions
797815
#' @name NULL
798816
setGeneric("arrays_zip", function(x, ...) { standardGeneric("arrays_zip") })
@@ -1214,6 +1232,10 @@ setGeneric("shiftRight", function(y, x) { standardGeneric("shiftRight") })
12141232
#' @name NULL
12151233
setGeneric("shiftRightUnsigned", function(y, x) { standardGeneric("shiftRightUnsigned") })
12161234

1235+
#' @rdname column_collection_functions
1236+
#' @name NULL
1237+
setGeneric("shuffle", function(x) { standardGeneric("shuffle") })
1238+
12171239
#' @rdname column_math_functions
12181240
#' @name NULL
12191241
setGeneric("signum", function(x) { standardGeneric("signum") })

0 commit comments

Comments
 (0)