Skip to content

[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference #28120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

kevinyu98
Copy link
Contributor

What changes were proposed in this pull request?

Document buildin aggregate functions

Why are the changes needed?

To make SQL Reference complete

Does this PR introduce any user-facing change?

Yes
before:
None
After:
Screen Shot 2020-04-04 at 12 37 16 PM
Screen Shot 2020-04-04 at 12 37 41 PM
Screen Shot 2020-04-04 at 12 37 54 PM
Screen Shot 2020-04-04 at 12 38 12 PM
Screen Shot 2020-04-04 at 12 38 28 PM
Screen Shot 2020-04-04 at 12 38 42 PM
Screen Shot 2020-04-04 at 12 38 52 PM
Screen Shot 2020-04-04 at 12 39 06 PM
Screen Shot 2020-04-04 at 12 39 21 PM
Screen Shot 2020-04-04 at 12 39 29 PM

How was this patch tested?

Manually build and check

Notes:
I list the aggregate function based on the functionRegistry aggregate functions session.
Here are the ones I didn't included, let me know if it is needed in this pr.
- cube
- rollup
- grouping
- grouping_id
- aggregate

Spark SQL provides build-in Aggregate functions defines in dataset API and SQL interface. Aggregate functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defines in dataset API -> defined in the dataset API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done.

<tbody>
<tr>
<td> <b>{avg | mean}</b>(<i>e: Column</i>)</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you list the functions in alphabetical order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

</tr>
<tr>
<td> <b>approx_count_distinct</b>(<i>e: Column</i>)</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has the optional relativeSD. Change to approx_count_distinct(expr[, relativeSD])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

<td> <b>count_if</b>(<i>Predicate</i>)</td>
<td>Expression that will be used for aggregation calculation</td>
<td>Returns the count number from the predicate evaluate to `TRUE` values</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backtick doesn't work inside html, use <code>TRUE</code>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<td> <b>{first | first_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
<td>Column name[, True/False(default)]</td>
<td>Returns the first value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<code>isIgnoreNull</code>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

</tr>
<tr>
<td> <b>{percentile_approx | percentile_approx}</b>(<i>e: Column, percentage [, frequency]</i>)</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a 3.1 function?

   * @group agg_funcs
   * @since 3.1.0
   */
  def percentile_approx(e: Column, percentage: Column, accuracy: Column): Column = {

</table>

### Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Example -> Examples?
Sometimes you have a blank line between examples, sometimes you don't. I guess make it consistent and always have a blank line in between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions.

**Note:** Every below function has another signature which take String as a column name instead of Column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which take String -> which takes String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<td> <b>{last | last_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
<td>Column name[, True/False(default)]</td>
<td>Returns the last value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<code>isIgnoreNull</code>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@maropu
Copy link
Member

maropu commented Apr 6, 2020

ok to test

Spark SQL provides build-in Aggregate functions defined in the dataset API and SQL interface. Aggregate functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, changed.

operate on a group of rows and return a single value.

Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spark SQL -> Spark SQL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120853 has finished for PR 28120 at commit f4aadff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions.

**Note:** Every below function has another signature which takes String as a column name instead of Column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All functions below have another signature...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will make changes.

@maropu maropu changed the title [SPARK-31349][SQL][DOCS] Sql ref buildin-aggregate [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference Apr 6, 2020
<tbody>
<tr>
<td> <b>{any | some | bool_or}</b>(<i>e: Column</i>)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: e -> c in the argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, changed all the e -> c .

|count_min_sketch(c1, 0.9, 0.2, 3) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00 00 03 00 00 00 00 5D 93 49 A6 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06]|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit too long... how about ommitting the output, e.g., [00 00 00 01 00...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120858 has finished for PR 28120 at commit 5cbecf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


* Table of contents
{:toc}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are few sections, how about removing {:toc}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

<td> <b>{avg | mean}</b>(<i>c: Column</i>)</td>
<td>Column name</td>
<td> Returns the average of values in the input column.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: could you remove unnecessary spaces? e.g., <td> Returns...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

<td> <b>{bool_and | every}</b>(<i>c: Column</i>)</td>
<td>Column name</td>
<td>Returns true if all values are true</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a period in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<td> <b>collect_list</b>(<i>c: Column</i>)</td>
<td>Column name</td>
<td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a period in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

<td> <b>corr</b>(<i>c1: Column, c2: Column</i>)</td>
<td>Column name</td>
<td>Returns Pearson coefficient of correlation between a set of number pairs</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a period in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

</tr>
<tr>
<td> <b>count</b>(<b>DISTINCT</b> <i> c: Column[, c: Column</i>])</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge the entries for count into a single entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<td> <b>count_if</b>(<i>Predicate</i>)</td>
<td>Expression that will be used for aggregation calculation</td>
<td>Returns the count number from the predicate evaluate to <code>TRUE</code> values</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems no <code>TRUE</code> exists in the existing docs, so <code>TRUE</code> -> `TRUE`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<td> <b>count_min_sketch</b>(<i>c: Column, eps: double, confidence: double, seed integer</i>)</td>
<td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td>
<td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space..</td> -> space.</td>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

</tr>
<tr>
<td> <b>{first | first_value}</b>(<i>c: Column[, isIgnoreNull]</i>)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isIgnoreNull -> isIgnoreNull: Boolean? btw, what does Column mean? I think we need to use concrete SQL types. How about following the PostgreSQL docs? https://www.postgresql.org/docs/current/functions-aggregate.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that Column is a type in spark, it is not a concrete type. I changed to use expression, maybe it is a better name. I changed the table format to use the concrete SQL types. I used the concrete type which internal code is checking, although they may take other data type as input(spark will try to cast).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the current one looks better! Thanks for the work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I think its better to use the same type names here with https://github.com/apache/spark/blob/master/docs/sql-ref-datatypes.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does my current type name look ok? boolean , numeric , string

@SparkQA
Copy link

SparkQA commented Apr 7, 2020

Test build #120893 has finished for PR 28120 at commit 85f4181.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions.

**Note:** All functions below have another signature which takes String as a expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

operate on a group of rows and return a single value.

Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Is this info useful for users?
Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yah, it is internal. I will remove.

<table class="table">
<thead>
<tr><th style="width:25%">Function</th><th>Parameter Type(s)</th><th>Description</th></tr>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as the Pg doc does so, I like Argument better than Parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td>
<td>(long, double)</td>
<td>RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to wrap RelativeSD with `?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

### Examples
{% highlight sql %}
--base table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need a space after --

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- A test table used in the following examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@SparkQA
Copy link

SparkQA commented Apr 8, 2020

Test build #120979 has finished for PR 28120 at commit 14d303f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 9, 2020

Could you apply the same cleanup with #28151 ?

<tr>
<td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td>
<td>(long, double)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how about using the same format for optional params? e.g., (long, double) -> (long[, double])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<tr>
<td><b>{avg | mean}</b>(<i>expression</i>)</td>
<td>short, float, byte, decimal, double, int, long or string</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<td>tinyint, short, int, bigint, float, double, or decimal</td>?

<td><b>count</b>([<b>DISTINCT</b>] <i>*</i>)</td>
<td>none</td>
<td>If specified <code>DISTINCT</code>, returns the total number of retrieved rows are unique and not null; Otherwise, returns the total number of retrieved rows, including rows containing null.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ; Otherwise -> ; otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

<td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, expression2</i>])</td>
<td>(any, any)</td>
<td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: ; Otherwise -> ; otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<tr>
<td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, expression2</i>])</td>
<td>(any, any)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(any, any) -> (any[, any])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<tr>
<td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td>
<td>(byte, short, int, long, string or binary, double, double, integer)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double, double, -> double, double,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<tr>
<td><b>count_if</b>(<i>predicate</i>)</td>
<td>expression that will be used for aggregation calculation</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about expression that will be used for aggregation calculation -> expression that returns a boolean value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, this is better. Done

<tr>
<td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
<td>(any, boolean)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(any, boolean) -> (any[, boolean])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
<td>(any, boolean)</td>
<td>Returns the first value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<code>isIgnoreNull</code> -> isIgnoreNull? Should we use ` or <code> for arguments?

Copy link
Contributor Author

@kevinyu98 kevinyu98 Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe , seems we don't have this isIgnoreNull` in spark code.

<tr>
<td><b>{last | last_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
<td>(any, boolean)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(any[, boolean])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

<tr>
<td><b>max</b>(<i>expression</i>)</td>
<td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to sort this in a consistent order, e.g,. tinyint, short, int, bigint, float, double, date, timestamp, string, or arrays of these types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

<tr>
<td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
<td>short, float, byte, decimal, double, int, or long, double, int</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. How about using different separators? e.g., (short|float|byte|decimal|double|int|long, double[, int])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<tr>
<td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
<td>short, float, byte, decimal, double, int, or long, double, int</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tinyint|short|int|bigint|float|double|date|timestamp, array of double[, int])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

SELECT COLLECT_LIST(c4) FROM buildin_agg;
+------------------------------------------------------+
|collect_list(c4) |
Copy link
Member

@maropu maropu Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the output right-aligned along with the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

+-------+

SELECT BOOL_OR(c5) FROM buildin_agg;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use lowercases except for the SQL keywords? e.g., SELECT bool_or(c5) FROM buildin_agg;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOOL_OR is the agg function, it is the alias of any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I know that. What do you mean? Since it is the alias of any, you cannot lowercase it?

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121046 has finished for PR 28120 at commit 9e283b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines +142 to +161
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
<td>(short|float|byte|decimal|double|int|bigint, double[, int])</td>
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
</tr>
<tr>
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
<td>(date|timestamp, double[, int])</td>
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
</tr>
<tr>
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
<td>(short|float|byte|decimal|double|int|bigint, array of double[, int])</td>
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
</tr>
<tr>
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
<td>(date|timestamp, array of double[, int])</td>
<td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
</tr>
Copy link
Member

@maropu maropu Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the 4 entries above again? {percentile_approx | percentile_approx}? we need the 4 entries for percentile_approx? We cannot merge them?

<tr>
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
<td>(short|float|byte|decimal|double|int|bigint, array of double[, int])</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no tinyint?

</tr>
<tr>
<td><b>{first | first_value}</b>(<i>expression[, `isIgnoreNull`]</i>)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the backquote here in the argument type section?

<tr>
<td><b>max_by</b>(<i>expression1, expression2</i>)</td>
<td>tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

Sorry guys I saw this just now. Can we reuse https://spark.apache.org/docs/latest/api/sql/index.html? I don't think we should duplicate them.

We should auto-generate (see also #27459) probably after adding some more fields in ExpressionDescription (see also #24259)

</tr>
<tr>
<td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to say array here because you said array of double in the argument section below.

</tr>
<tr>
<td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frequency -> accuracy?

@maropu
Copy link
Member

maropu commented Apr 22, 2020

This issue has been resolved in #28224. Welcome any activity to improve the document. Anyway, thanks for the work!

@maropu maropu closed this Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants