-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference #28120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Spark SQL provides build-in Aggregate functions defines in dataset API and SQL interface. Aggregate functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defines in dataset API
-> defined in the dataset API
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done.
<tbody> | ||
<tr> | ||
<td> <b>{avg | mean}</b>(<i>e: Column</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you list the functions in alphabetical order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
</tr> | ||
<tr> | ||
<td> <b>approx_count_distinct</b>(<i>e: Column</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has the optional relativeSD
. Change to approx_count_distinct(expr[, relativeSD])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
<td> <b>count_if</b>(<i>Predicate</i>)</td> | ||
<td>Expression that will be used for aggregation calculation</td> | ||
<td>Returns the count number from the predicate evaluate to `TRUE` values</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backtick doesn't work inside html, use <code>TRUE</code>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<td> <b>{first | first_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td> | ||
<td>Column name[, True/False(default)]</td> | ||
<td>Returns the first value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<code>isIgnoreNull</code>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
</tr> | ||
<tr> | ||
<td> <b>{percentile_approx | percentile_approx}</b>(<i>e: Column, percentage [, frequency]</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a 3.1 function?
* @group agg_funcs
* @since 3.1.0
*/
def percentile_approx(e: Column, percentage: Column, accuracy: Column): Column = {
</table> | ||
|
||
### Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Example
-> Examples
?
Sometimes you have a blank line between examples, sometimes you don't. I guess make it consistent and always have a blank line in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions. | ||
|
||
**Note:** Every below function has another signature which take String as a column name instead of Column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which take String
-> which takes String
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<td> <b>{last | last_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td> | ||
<td>Column name[, True/False(default)]</td> | ||
<td>Returns the last value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<code>isIgnoreNull</code>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ok to test |
Spark SQL provides build-in Aggregate functions defined in the dataset API and SQL interface. Aggregate functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Aggregate
-> aggregate
along with the others? e.g., https://github.com/apache/spark/blame/master/docs/sql-ref-syntax-qry-select-having.md#L71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, changed.
operate on a group of rows and return a single value. | ||
|
||
Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spark SQL
-> Spark SQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
Test build #120853 has finished for PR 28120 at commit
|
Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions. | ||
|
||
**Note:** Every below function has another signature which takes String as a column name instead of Column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All functions below have another signature...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will make changes.
<tbody> | ||
<tr> | ||
<td> <b>{any | some | bool_or}</b>(<i>e: Column</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: e
-> c
in the argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, changed all the e
-> c
.
|count_min_sketch(c1, 0.9, 0.2, 3) | | ||
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00 00 03 00 00 00 00 5D 93 49 A6 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06]| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit too long... how about ommitting the output, e.g., [00 00 00 01 00...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Test build #120858 has finished for PR 28120 at commit
|
|
||
* Table of contents | ||
{:toc} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are few sections, how about removing {:toc}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
<td> <b>{avg | mean}</b>(<i>c: Column</i>)</td> | ||
<td>Column name</td> | ||
<td> Returns the average of values in the input column.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: could you remove unnecessary spaces? e.g., <td> Returns...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
<td> <b>{bool_and | every}</b>(<i>c: Column</i>)</td> | ||
<td>Column name</td> | ||
<td>Returns true if all values are true</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a period in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<td> <b>collect_list</b>(<i>c: Column</i>)</td> | ||
<td>Column name</td> | ||
<td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a period in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
<td> <b>corr</b>(<i>c1: Column, c2: Column</i>)</td> | ||
<td>Column name</td> | ||
<td>Returns Pearson coefficient of correlation between a set of number pairs</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a period in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
</tr> | ||
<tr> | ||
<td> <b>count</b>(<b>DISTINCT</b> <i> c: Column[, c: Column</i>])</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we merge the entries for count
into a single entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<td> <b>count_if</b>(<i>Predicate</i>)</td> | ||
<td>Expression that will be used for aggregation calculation</td> | ||
<td>Returns the count number from the predicate evaluate to <code>TRUE</code> values</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems no <code>TRUE</code>
exists in the existing docs, so <code>TRUE</code>
-> `TRUE`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<td> <b>count_min_sketch</b>(<i>c: Column, eps: double, confidence: double, seed integer</i>)</td> | ||
<td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td> | ||
<td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space..</td>
-> space.</td>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
</tr> | ||
<tr> | ||
<td> <b>{first | first_value}</b>(<i>c: Column[, isIgnoreNull]</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isIgnoreNull
-> isIgnoreNull: Boolean
? btw, what does Column
mean? I think we need to use concrete SQL types. How about following the PostgreSQL docs? https://www.postgresql.org/docs/current/functions-aggregate.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that Column
is a type in spark, it is not a concrete type. I changed to use expression
, maybe it is a better name. I changed the table format to use the concrete SQL types. I used the concrete type which internal code is checking, although they may take other data type as input(spark will try to cast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the current one looks better! Thanks for the work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I think its better to use the same type names here with https://github.com/apache/spark/blob/master/docs/sql-ref-datatypes.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does my current type name look ok? boolean
, numeric
, string
Test build #120893 has finished for PR 28120 at commit
|
Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions. | ||
|
||
**Note:** All functions below have another signature which takes String as a expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
operate on a group of rows and return a single value. | ||
|
||
Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Is this info useful for users?
Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yah, it is internal. I will remove.
<table class="table"> | ||
<thead> | ||
<tr><th style="width:25%">Function</th><th>Parameter Type(s)</th><th>Description</th></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as the Pg doc does so, I like Argument
better than Parameter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td> | ||
<td>(long, double)</td> | ||
<td>RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to wrap RelativeSD
with `?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
### Examples | ||
{% highlight sql %} | ||
--base table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need a space after --
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- A test table used in the following examples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
Test build #120979 has finished for PR 28120 at commit
|
Could you apply the same cleanup with #28151 ? |
<tr> | ||
<td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td> | ||
<td>(long, double)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using more SQL-like type names? e.g., long
-> bigint
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L547-L558
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how about using the same format for optional params? e.g., (long, double)
-> (long[, double])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<tr> | ||
<td><b>{avg | mean}</b>(<i>expression</i>)</td> | ||
<td>short, float, byte, decimal, double, int, long or string</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<td>tinyint, short, int, bigint, float, double, or decimal</td>
?
<td><b>count</b>([<b>DISTINCT</b>] <i>*</i>)</td> | ||
<td>none</td> | ||
<td>If specified <code>DISTINCT</code>, returns the total number of retrieved rows are unique and not null; Otherwise, returns the total number of retrieved rows, including rows containing null.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ; Otherwise
-> ; otherwise
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
<td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, expression2</i>])</td> | ||
<td>(any, any)</td> | ||
<td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: ; Otherwise
-> ; otherwise
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<tr> | ||
<td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, expression2</i>])</td> | ||
<td>(any, any)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(any, any)
-> (any[, any])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<tr> | ||
<td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td> | ||
<td>(byte, short, int, long, string or binary, double, double, integer)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double, double,
-> double, double,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<tr> | ||
<td><b>count_if</b>(<i>predicate</i>)</td> | ||
<td>expression that will be used for aggregation calculation</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about expression that will be used for aggregation calculation
-> expression that returns a boolean value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, this is better. Done
<tr> | ||
<td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td> | ||
<td>(any, boolean)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(any, boolean)
-> (any[, boolean])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td> | ||
<td>(any, boolean)</td> | ||
<td>Returns the first value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<code>isIgnoreNull</code>
-> isIgnoreNull
? Should we use ` or <code> for arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe , seems we don't have this
isIgnoreNull` in spark code.
<tr> | ||
<td><b>{last | last_value}</b>(<i>expression[, isIgnoreNull]</i>)</td> | ||
<td>(any, boolean)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(any[, boolean])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
<tr> | ||
<td><b>max</b>(<i>expression</i>)</td> | ||
<td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to sort this in a consistent order, e.g,. tinyint, short, int, bigint, float, double, date, timestamp, string, or arrays of these types
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
<tr> | ||
<td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td> | ||
<td>short, float, byte, decimal, double, int, or long, double, int</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. How about using different separators? e.g., (short|float|byte|decimal|double|int|long, double[, int])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
<tr> | ||
<td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td> | ||
<td>short, float, byte, decimal, double, int, or long, double, int</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(tinyint|short|int|bigint|float|double|date|timestamp, array of double[, int])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
SELECT COLLECT_LIST(c4) FROM buildin_agg; | ||
+------------------------------------------------------+ | ||
|collect_list(c4) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make the output right-aligned along with the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
+-------+ | ||
|
||
SELECT BOOL_OR(c5) FROM buildin_agg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use lowercases except for the SQL keywords? e.g., SELECT bool_or(c5) FROM buildin_agg;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOOL_OR
is the agg function, it is the alias of any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I know that. What do you mean? Since it is the alias of any, you cannot lowercase it?
Test build #121046 has finished for PR 28120 at commit
|
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td> | ||
<td>(short|float|byte|decimal|double|int|bigint, double[, int])</td> | ||
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td> | ||
</tr> | ||
<tr> | ||
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td> | ||
<td>(date|timestamp, double[, int])</td> | ||
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td> | ||
</tr> | ||
<tr> | ||
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td> | ||
<td>(short|float|byte|decimal|double|int|bigint, array of double[, int])</td> | ||
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td> | ||
</tr> | ||
<tr> | ||
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td> | ||
<td>(date|timestamp, array of double[, int])</td> | ||
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td> | ||
</tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the 4 entries above again? {percentile_approx | percentile_approx}
? we need the 4 entries for percentile_approx
? We cannot merge them?
<tr> | ||
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td> | ||
<td>(short|float|byte|decimal|double|int|bigint, array of double[, int])</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no tinyint?
</tr> | ||
<tr> | ||
<td><b>{first | first_value}</b>(<i>expression[, `isIgnoreNull`]</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the backquote here in the argument type section?
<tr> | ||
<td><b>max_by</b>(<i>expression1, expression2</i>)</td> | ||
<td>tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check again if all the input types are correct? max_by
/min_by
seems to accept null types and a struct of orderable element types?
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala#L49
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala#L101
Sorry guys I saw this just now. Can we reuse https://spark.apache.org/docs/latest/api/sql/index.html? I don't think we should duplicate them. We should auto-generate (see also #27459) probably after adding some more fields in |
</tr> | ||
<tr> | ||
<td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to say array
here because you said array of double
in the argument section below.
</tr> | ||
<tr> | ||
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frequency
-> accuracy
?
This issue has been resolved in #28224. Welcome any activity to improve the document. Anyway, thanks for the work! |
What changes were proposed in this pull request?
Document buildin aggregate functions
Why are the changes needed?
To make SQL Reference complete
Does this PR introduce any user-facing change?
Yes










before:
None
After:
How was this patch tested?
Manually build and check
Notes:
I list the aggregate function based on the
functionRegistry
aggregate functions session.Here are the ones I didn't included, let me know if it is needed in this pr.
-
cube
-
rollup
-
grouping
-
grouping_id
-
aggregate