Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions doc/sql.extensions/README.percentile_disc_cont.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# PERCENTILE_DISC and PERCENTILE_CONT functions

The `PERCENTILE_CONT` and `PERCENTILE_DISC` functions are known as inverse distribution functions.
These functions operate on an ordered set. Both functions can be used as aggregate or window functions.

## PERCENTILE_DISC

`PERCENTILE_DISC` is an inverse distribution function that assumes a discrete distribution model.
It takes a percentile value and a sort specification and returns an element from the set.
Nulls are ignored in the calculation.

Syntax for the `PERCENTILE_DISC` function as an aggregate function.

```
PERCENTILE_DISC(<percent>) WITHIN GROUP (ORDER BY <expr> [ASC | DESC])

<percent> ::= numeric_literal | <parameter> | <variable>
```

Syntax for the `PERCENTILE_DISC` function as an window function.

```
PERCENTILE_DISC(<percent>) WITHIN GROUP (ORDER BY <expr> [ASC | DESC])
OVER (PARTITION BY <part_expr>)

<percent> ::= numeric_literal | <parameter> | <variable>
```

The first argument `<percent>` must evaluate to a numeric value between 0 and 1, because it is a percentile value.
This expression must be constant within each aggregate group.
The `ORDER BY` clause takes a single expression that can be of any type that can be sorted.

The function `PERCENTILE_DISC` returns a value of the same type as the argument in `ORDER BY`.

For a given percentile value `P`, `PERCENTILE_DISC` sorts the values of the expression in the `ORDER BY` clause and
returns the value with the smallest `CUME_DIST` value (with respect to the same sort specification)
that is greater than or equal to `P`.

### Analytic Example

```sql
SELECT
DEPT_NO,
SALARY,
CUME_DIST() OVER(PARTITION BY DEPT_NO ORDER BY SALARY) AS "CUME_DIST",
PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY SALARY)
OVER(PARTITION BY DEPT_NO) AS MEDIAN_DISC
FROM EMPLOYEE
WHERE DEPT_NO < 600
ORDER BY 1, 2;
```

```
DEPT_NO SALARY CUME_DIST MEDIAN_DISC
======= ===================== ======================= =====================
000 53793.00 0.5000000000000000 53793.00
000 212850.00 1.000000000000000 53793.00
100 44000.00 0.5000000000000000 44000.00
100 111262.50 1.000000000000000 44000.00
110 61637.81 0.5000000000000000 61637.81
110 68805.00 1.000000000000000 61637.81
115 6000000.00 0.5000000000000000 6000000.00
115 7480000.00 1.000000000000000 6000000.00
120 22935.00 0.3333333333333333 33620.63
120 33620.63 0.6666666666666666 33620.63
120 39224.06 1.000000000000000 33620.63
121 110000.00 1.000000000000000 110000.00
123 38500.00 1.000000000000000 38500.00
125 33000.00 1.000000000000000 33000.00
130 86292.94 0.5000000000000000 86292.94
130 102750.00 1.000000000000000 86292.94
140 100914.00 1.000000000000000 100914.00
180 42742.50 0.5000000000000000 42742.50
180 64635.00 1.000000000000000 42742.50
```

## PERCENTILE_CONT

`PERCENTILE_CONT` is an inverse distribution function that assumes a continuous distribution model.
It takes a percentile value and a sort specification and returns an element from the set.
Nulls are ignored in the calculation.

Syntax for the `PERCENTILE_CONT` function as an aggregate function.

```
PERCENTILE_CONT(<percent>) WITHIN GROUP (ORDER BY <expr> [ASC | DESC])

<percent> ::= numeric_literal | <parameter> | <variable>
```

Syntax for the `PERCENTILE_CONT` function as an window function.

```
PERCENTILE_CONT(<percent>) WITHIN GROUP (ORDER BY <expr> [ASC | DESC])
OVER (PARTITION BY <part_expr>)

<percent> ::= numeric_literal | <parameter> | <variable>
```

The first argument `<percent>` must evaluate to a numeric value between 0 and 1, because it is a percentile value.
This expression must be constant within each aggregate group.
The `ORDER BY` clause takes a single expression, which must be of numeric type to perform interpolation.

The `PERCENTILE_CONT` function returns a value of type `DOUBLE PRECISION` or `DECFLOAT(34)` depending on the type
of the argument in the `ORDER BY` clause. A value of type `DECFLOAT(34)` is returned if `ORDER BY` contains
an expression of one of the types `INT128`, `NUMERIC(38, x)` or `DECFLOAT(16 | 34)`, otherwise - `DOUBLE PRECISION`.

The result of `PERCENTILE_CONT` is computed by linear interpolation between values after ordering them.
Using the percentile value (`P`) and the number of rows (`N`) in the aggregation group, you can compute
the row number you are interested in after ordering the rows with respect to the sort specification.
This row number (`RN`) is computed according to the formula `RN = (1 + (P * (N - 1))`.
The final result of the aggregate function is computed by linear interpolation between the values from rows
at row numbers `CRN = CEILING(RN)` and `FRN = FLOOR(RN)`.

```
function f(N) ::= value of expression from row at N

if (CRN = FRN = RN) then
return f(RN)
else
return (CRN - RN) * f(FRN) + (RN - FRN) * f(CRN)
```

### Analytic Example

```sql
SELECT
DEPT_NO,
SALARY,
PERCENT_RANK() OVER(PARTITION BY DEPT_NO ORDER BY SALARY) AS "PERCENT_RANK",
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY SALARY)
OVER(PARTITION BY DEPT_NO) AS MEDIAN_CONT
FROM EMPLOYEE
WHERE DEPT_NO < 600
ORDER BY 1, 2;
```

```
DEPT_NO SALARY PERCENT_RANK MEDIAN_CONT
======= ===================== ======================= =======================
000 53793.00 0.000000000000000 133321.5000000000
000 212850.00 1.000000000000000 133321.5000000000
100 44000.00 0.000000000000000 77631.25000000000
100 111262.50 1.000000000000000 77631.25000000000
110 61637.81 0.000000000000000 65221.40500000000
110 68805.00 1.000000000000000 65221.40500000000
115 6000000.00 0.000000000000000 6740000.000000000
115 7480000.00 1.000000000000000 6740000.000000000
120 22935.00 0.000000000000000 33620.63000000000
120 33620.63 0.5000000000000000 33620.63000000000
120 39224.06 0.2500000000000000 33620.63000000000
121 110000.00 0.000000000000000 110000.0000000000
123 38500.00 0.000000000000000 38500.00000000000
125 33000.00 0.000000000000000 33000.00000000000
130 86292.94 0.000000000000000 94521.47000000000
130 102750.00 1.000000000000000 94521.47000000000
140 100914.00 0.000000000000000 100914.0000000000
180 42742.50 0.000000000000000 53688.75000000000
180 64635.00 1.000000000000000 53688.75000000000
```

## An example of using both aggregate functions

```sql
SELECT
DEPT_NO,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY SALARY) AS MEDIAN_CONT,
PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY SALARY) AS MEDIAN_DISC
FROM EMPLOYEE
GROUP BY DEPT_NO;
```

```
DEPT_NO MEDIAN_CONT MEDIAN_DISC
======= ======================= =====================
000 133321.5000000000 53793.00
100 77631.25000000000 44000.00
110 65221.40500000000 61637.81
115 6740000.000000000 6000000.00
120 33620.63000000000 33620.63
121 110000.0000000000 110000.00
123 38500.00000000000 38500.00
125 33000.00000000000 33000.00
130 94521.47000000000 86292.94
140 100914.0000000000 100914.00
180 53688.75000000000 42742.50
600 66450.00000000000 27000.00
621 71619.75000000000 62550.00
622 53167.50000000000 53167.50
623 60000.00000000000 60000.00
670 71268.75000000000 31275.00
671 81810.19000000000 81810.19
672 45647.50000000000 35000.00
900 92791.31500000000 69482.63
```
2 changes: 2 additions & 0 deletions src/common/ParserTokens.h
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,8 @@ PARSER_TOKEN(TOK_PARAMETER, "PARAMETER", false)
PARSER_TOKEN(TOK_PARTITION, "PARTITION", true)
PARSER_TOKEN(TOK_PASSWORD, "PASSWORD", true)
PARSER_TOKEN(TOK_PERCENT_RANK, "PERCENT_RANK", true)
PARSER_TOKEN(TOK_PERCENTILE_CONT, "PERCENTILE_CONT", true)
PARSER_TOKEN(TOK_PERCENTILE_DISC, "PERCENTILE_DISC", true)
PARSER_TOKEN(TOK_PI, "PI", true)
PARSER_TOKEN(TOK_PKCS_1_5, "PKCS_1_5", true)
PARSER_TOKEN(TOK_PLACING, "PLACING", true)
Expand Down
Loading
Loading