Skip to content

Commit bdd86bd

Browse files
authored
[YQ-3985]: M_R: актуализировать документацию (#15846)
1 parent 2530f5d commit bdd86bd

File tree

6 files changed

+734
-6
lines changed

6 files changed

+734
-6
lines changed

ydb/docs/en/core/yql/reference/syntax/select/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@ SELECT 2 + 2;
2323
The `SELECT` query result is calculated as follows:
2424

2525
* Determine the set of input tables by evaluating the [FROM](from.md) clauses.
26-
* Apply [SAMPLE](sample.md)/[TABLESAMPLE](sample.md) to input tables.
26+
* Apply [MATCH_RECOGNIZE](match_recognize.md) to input tables.
27+
* Evaluate [SAMPLE](sample.md)/[TABLESAMPLE](sample.md).
2728
* Execute [FLATTEN COLUMNS](../flatten.md#flatten-columns) or [FLATTEN BY](../flatten.md); aliases set in `FLATTEN BY` become visible after this point.
2829

2930
{% if feature_join %}
@@ -129,6 +130,7 @@ If the underlying queries have one of the `ORDER BY/LIMIT/DISCARD/INTO RESULT` o
129130
* [LIMIT OFFSET](limit_offset.md)
130131
* [SAMPLE](sample.md)
131132
* [TABLESAMPLE](sample.md)
133+
* [MATCH_RECOGNIZE](match_recognize.md)
132134

133135
{% if yt %}
134136

Lines changed: 363 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,363 @@
1+
# MATCH_RECOGNIZE
2+
3+
The `MATCH_RECOGNIZE` expression performs pattern recognition in a sequence of rows and returns the found results. This functionality is important for various business areas, such as fraud detection, pricing analysis in finance, and sensor data processing. This area is known as Complex Event Processing (CEP), and pattern recognition is a valuable tool for this. An example of how `MATCH_RECOGNIZE` works is provided in the [link](#example).
4+
5+
## Data processing algorithm
6+
7+
The `MATCH_RECOGNIZE` expression performs the following actions:
8+
9+
1. The input table is divided into non-overlapping groups. Each group consists of a set of rows from the input table with identical values in the columns listed after `PARTITION BY`.
10+
2. Each group is ordered according to the `ORDER BY` clause.
11+
3. Recognition of pattern from `PATTERN` is performed independently in each ordered group.
12+
4. Pattern search in the sequence of rows is a step-by-step process: rows are checked one by one if they fit the pattern. Among all matches starting in the earliest row, the one consisting of the largest number of rows is selected. If no matches were found starting in the earliest row, the search continues starting from the next row.
13+
5. After a match is found, the columns defined by expressions in the `MEASURES` block are calculated.
14+
6. Depending on the `ROWS PER MATCH` mode, one or all rows for the found match are output.
15+
7. The `AFTER MATCH SKIP` mode determines from which row the pattern recognition will resume.
16+
17+
## Syntax {#syntax}
18+
19+
```sql
20+
MATCH_RECOGNIZE (
21+
[ PARTITION BY <partition_1> [ ... , <partition_N> ] ]
22+
[ ORDER BY <sort_key_1> [ ... , <sort_key_N> ] ]
23+
[ MEASURES <expression_1> AS <column_name_1> [ ... , <expression_N> AS <column_name_N> ] ]
24+
[ ROWS PER MATCH ]
25+
[ AFTER MATCH SKIP ]
26+
PATTERN (<search_pattern>)
27+
DEFINE <variable_1> AS <predicate_1> [ ... , <variable_N> AS <predicate_N> ]
28+
)
29+
```
30+
31+
Here is a brief description of the SQL syntax elements of the `MATCH_RECOGNIZE` expression:
32+
33+
* [`DEFINE`](#define): Block for declaring variables that describe the search pattern and the conditions that rows must meet for each variable.
34+
* [`PATTERN`](#pattern): [Regular expressions](https://en.wikipedia.org/wiki/Regular_expressions) describing the search pattern.
35+
* [`MEASURES`](#measures): Defines the list of columns for the returned data. Each column is specified by an SQL expression for its computation.
36+
* [`ROWS PER MATCH`](#rows_per_match): Determines the structure of the returned data and the number of rows for each match found.
37+
* [`AFTER MATCH SKIP`](#after_match_skip): Defines the method of moving to the point of the next match search.
38+
* [`ORDER BY`](#order_by): Determines sorting of input data. Pattern search is performed within the data sorted according to the list of columns or expressions listed in `<sort_key_1> [ ... , <sort_key_N> ]`.
39+
* [`PARTITION BY`](#partition_by): Divides the input table according to the specified rules in accordance with `<partition_1> [ ... , <partition_N> ]`. Pattern search is performed independently in each part.
40+
41+
### DEFINE {#define}
42+
43+
```sql
44+
DEFINE <variable_1> AS <predicate_1> [ ... , <variable_N> AS <predicate_N> ]
45+
```
46+
47+
`DEFINE` declares variables that are used to describe the desired pattern defined in [`PATTERN`](#pattern). Variables are named SQL statements evaluated over the input data. The syntax of the SQL statements in `DEFINE` is the same as the SQL statements of the `WHERE` predicate. For example, the `button = 1` expression searches for rows with the value `1` in the `button` column. Any SQL expressions that can be used to perform a search, including aggregation functions (`LAST`, `FIRST`). For example, `button > 2 AND zone_id < 12` or `LAST(button) > 10`.
48+
49+
In the example below, the SQL statement `A.button = 1` is declared as variable `A`.
50+
51+
```sql
52+
DEFINE
53+
A AS A.button = 1
54+
```
55+
56+
{% note info %}
57+
58+
`DEFINE` does not currently support aggregation functions (e.g., `AVG`, `MIN`, or `MAX`) and `PREV` and `NEXT` functions.
59+
60+
{% endnote %}
61+
62+
When processing each row of data, all SQL statements describing variables in `DEFINE` are calculated. When the SQL-expression describing the corresponding variable from `DEFINE` gets the `TRUE` value, such a row is labeled with the `DEFINE` variable name and added to the list of rows subject to pattern matching.
63+
64+
#### **Example** {#define-example}
65+
66+
When defining variables in SQL expressions, you can reference other variables:
67+
68+
```sql
69+
DEFINE
70+
A AS A.button = 1 AND LAST(A.zone_id) = 12,
71+
B AS B.button = 2 AND FIRST(A.zone_id) = 12
72+
```
73+
74+
An input data row will be computed as variable `A` if it contains a `button` column with value `1` and the last row of the set of previously matched `A` has a `zone_id` column with value `12`. The row will be computed as variable `B` if the data row contains a `button` column with value `2` and the first row of the set of previously matched variables `A` has a `zone_id` column with value `12`.
75+
76+
### PATTERN {#pattern}
77+
78+
```sql
79+
PATTERN (<search_pattern>)
80+
```
81+
82+
The `PATTERN` keyword describes the search pattern in the format derived from variables in the `DEFINE` section. The `PATTERN` syntax is similar to the one of [regular expressions](https://en.wikipedia.org/wiki/Regular_expressions).
83+
84+
{% note warning %}
85+
86+
If a variable used in the `PATTERN` section has not been previously described in the `DEFINE` section, it is assumed that it is always `TRUE`.
87+
88+
{% endnote %}
89+
90+
You can use [quantifiers](https://en.wikipedia.org/wiki/Regular_expression#Quantification) in `PATTERN`. In regular expressions, they determine the number of repetitions of an element or subsequence in the matched pattern. Here is the list of supported quantifiers:
91+
92+
|Quantifier|Description|
93+
|-|-|
94+
|`A+`|One or more occurrences of `A`|
95+
|`A*`|Zero or more occurrences of `A`|
96+
|`A?`|Zero or one occurrence of `A`|
97+
|`B{n}`|Exactly `n` occurrences of `B`|
98+
|`C{n, m}`|From `n` to `m` occurrences of `C`|
99+
|`D{n,}`|At least `n` occurrences of `D`|
100+
|`(A\|B)`|Occurrence of `A` or `B` in the data|
101+
|`(A\|B){,m}`|From zero to `m` occurrences of `A` or `B`|
102+
103+
Supported pattern search sequences:
104+
105+
|Supported sequences|Syntax|Description|
106+
|-|-|-|
107+
|Sequence|`A B+ C+ D+`|The system searches for the exact specified sequence, the occurrence of other variables within the sequence is not allowed. The pattern search is performed in the order of the pattern variables.|
108+
|One of|`A \| B \| C`|Variables are listed in any order with a pipe \| between them. The search is performed for any variable from the specified list.|
109+
|Grouping|`(A \| B)+ \| C`|Variables inside round brackets are considered a single group. In this case, quantifiers apply to the entire group.|
110+
|Exclusion from result|`{- A B+ C -}`|Rows found by the pattern in parentheses will be excluded from the result in [`ALL ROWS PER MATCH`](#rows_per_match) mode|
111+
112+
#### **Example** {#pattern-example}
113+
114+
```sql
115+
PATTERN (B1 E* B2+ B3)
116+
DEFINE
117+
B1 as B1.button = 1,
118+
B2 as B2.button = 2,
119+
B3 as B3.button = 3
120+
```
121+
122+
The `DEFINE` section describes the `B1`, `B2`, and `B3` variables, while it does not describe `E`. Such notation allows interpreting `E` as any event, so the following pattern will be searched: one `button 1` click, one or more `button 2` clicks, and one `button 3` click. Meanwhile, between a click of `button 1` and `button 2`, any number of any other events may occur.
123+
124+
### MEASURES {#measures}
125+
126+
```sql
127+
MEASURES <expression_1> AS <column_name_1> [ ... , <expression_N> AS <column_name_N> ]
128+
```
129+
130+
`MEASURES` describes the set of returned columns when a pattern is found. A set of returned columns should be represented by an SQL expression with the aggregate functions over the variables declared in the [`DEFINE`](#define) statement.
131+
132+
#### **Example** {#measures-example}
133+
134+
The input data for the example are:
135+
136+
|ts|button|device_id|zone_id|
137+
|:-:|:-:|:-:|:-:|
138+
|100|1|3|0|
139+
|200|1|3|1|
140+
|300|2|2|0|
141+
|400|3|1|1|
142+
143+
```sql
144+
MEASURES
145+
AGGREGATE_LIST(B1.zone_id * 10 + B1.device_id) AS ids,
146+
COUNT(DISTINCT B1.zone_id) AS count_zones,
147+
LAST(B3.ts) - FIRST(B1.ts) AS time_diff,
148+
42 AS meaning_of_life
149+
PATTERN (B1+ B2 B3)
150+
DEFINE
151+
B1 AS B1.button = 1,
152+
B2 AS B2.button = 2,
153+
B3 AS B3.button = 3
154+
```
155+
156+
Result:
157+
158+
|ids|count_zones|time_diff|meaning_of_life|
159+
|:-:|:-:|:-:|:-:|
160+
|[3,13]|2|300|42|
161+
162+
The `ids` column contains the list of `zone_id * 10 + device_id` values counted among the rows matched with the `B1` variable. The `count_zones` column contains the number of unique values of the `zone_id` column among the rows matched with the `B1` variable. Column `time_diff` contains the difference between the value of column `ts` in the last row of the set of rows matched with variable `B3` and the value of column `ts` in the first row of the set of rows matched with variable `B1`. The `meaning_of_life` column contains the constant `42`. Thus, an expression in `MEASURES` may contain aggregate functions over multiple variables, but there must be only one variable within a single aggregate function.
163+
164+
### ROWS PER MATCH {#rows_per_match}
165+
166+
`ROWS PER MATCH` determines the number of result rows for each match found, as well as the number of columns returned. The default mode is `ONE ROW PER MATCH`.
167+
168+
`ONE ROW PER MATCH` sets the `ROWS PER MATCH` mode to output one row for the match found. The structure of the returned data corresponds to the columns listed in [`PARTITION BY`](#partition_by) and [`MEASURES`](#measures).
169+
170+
`ALL ROWS PER MATCH` sets the `ROWS PER MATCH` mode to output all rows of the match found except explicitly excluded by parentheses. In addition to the columns of the source table, the structure of the returned data includes the columns listed in the [`MEASURES`](#measures).
171+
172+
#### **Examples** {#rows_per_match-examples}
173+
174+
The input data for all examples are:
175+
176+
|ts|button|
177+
|:-:|:-:|
178+
|100|1|
179+
|200|2|
180+
|300|3|
181+
182+
##### **Example 1** {#rows_per_match-example1}
183+
184+
```sql
185+
MEASURES
186+
FIRST(B1.ts) AS first_ts,
187+
FIRST(B2.ts) AS mid_ts,
188+
LAST(B3.ts) AS last_ts
189+
ONE ROW PER MATCH
190+
PATTERN (B1 {- B2 -} B3)
191+
DEFINE
192+
B1 AS B1.button = 1,
193+
B2 AS B2.button = 2,
194+
B3 AS B3.button = 3
195+
```
196+
197+
Result:
198+
199+
|first_ts|mid_ts|last_ts|
200+
|:-:|:-:|:-:|
201+
|100|200|300|
202+
203+
##### **Example 2** {#rows_per_match-example2}
204+
205+
```sql
206+
MEASURES
207+
FIRST(B1.ts) AS first_ts,
208+
FIRST(B2.ts) AS mid_ts,
209+
LAST(B3.ts) AS last_ts
210+
ALL ROWS PER MATCH
211+
PATTERN (B1 {- B2 -} B3)
212+
DEFINE
213+
B1 AS B1.button = 1,
214+
B2 AS B2.button = 2,
215+
B3 AS B3.button = 3
216+
```
217+
218+
Result:
219+
220+
|first_ts|mid_ts|last_ts|button|ts|
221+
|:-:|:-:|:-:|:-:|:-:|
222+
|100|200|300|1|100|
223+
|100|200|300|3|300|
224+
225+
### AFTER MATCH SKIP {#after_match_skip}
226+
227+
`AFTER MATCH SKIP` determines the method of transitioning from the found match to searching for the next one. In the `AFTER MATCH SKIP TO NEXT ROW` mode, the search for the next match starts after the first row of the previous one, while in the `AFTER MATCH SKIP PAST LAST ROW` mode it starts after the last row of the previous match. The default mode is `PAST LAST ROW`.
228+
229+
#### Examples {#after_match_skip-examples}
230+
231+
The input data for all examples are:
232+
233+
|ts|button|
234+
|:-:|:-:|
235+
|100|1|
236+
|200|1|
237+
|300|2|
238+
|400|3|
239+
240+
##### **Example 1** {#after_match_skip-example1}
241+
242+
```sql
243+
MEASURES
244+
FIRST(B1.ts) AS first_ts,
245+
LAST(B3.ts) AS last_ts
246+
AFTER MATCH SKIP TO NEXT ROW
247+
PATTERN (B1+ B2 B3)
248+
DEFINE
249+
B1 AS B1.button = 1,
250+
B2 AS B2.button = 2,
251+
B3 AS B3.button = 3
252+
```
253+
254+
Result:
255+
256+
|first_ts|last_ts|
257+
|:-:|:-:|
258+
|100|400|
259+
|200|400|
260+
261+
##### **Example 2** {#after_match_skip-example2}
262+
263+
```sql
264+
MEASURES
265+
FIRST(B1.ts) AS first_ts,
266+
LAST(B3.ts) AS last_ts
267+
AFTER MATCH SKIP PAST LAST ROW
268+
PATTERN (B1+ B2 B3)
269+
DEFINE
270+
B1 AS B1.button = 1,
271+
B2 AS B2.button = 2,
272+
B3 AS B3.button = 3
273+
```
274+
275+
Result:
276+
277+
|first_ts|last_ts|
278+
|:-:|:-:|
279+
|100|400|
280+
281+
### ORDER BY {#order_by}
282+
283+
```sql
284+
ORDER BY <sort_key_1> [ ... , <sort_key_N> ]
285+
286+
<sort_key> ::= { <column_names> | <expression> }
287+
```
288+
289+
`ORDER BY` determines sorting of the input data. That is, before all pattern search operations are performed, the data will be pre-sorted according to the specified keys or expressions. The syntax is similar to the `ORDER BY` SQL expression.
290+
291+
#### **Example** {#order_by-example}
292+
293+
```sql
294+
ORDER BY CAST(ts AS Timestamp)
295+
```
296+
297+
### PARTITION BY {#partition_by}
298+
299+
```sql
300+
PARTITION BY <partition_1> [ ... , <partition_N> ]
301+
302+
<partition> ::= { <column_names> | <expression> }
303+
```
304+
305+
`PARTITION BY` partitions the source data into multiple non-overlapping groups, each used for an independent pattern search. If the expression is not specified, all data is processed as a single group. Records with the same values of the columns listed after `PARTITION BY` fall into the same group.
306+
307+
#### **Example** {#partition_by-example}
308+
309+
```sql
310+
PARTITION BY device_id, zone_id
311+
```
312+
313+
## Limitations {#limitations}
314+
315+
Our support for the `MATCH_RECOGNIZE` expression will eventually comply with [SQL-2016](https://ru.wikipedia.org/wiki/SQL:2016); currently, however, the following limitations apply:
316+
317+
- [`MEASURES`](#measures). Functions `PREV`/`NEXT` are not supported.
318+
- [`AFTER MATCH SKIP`](#after_match_skip). Only the `AFTER MATCH SKIP TO NEXT ROW` and `AFTER MATCH SKIP PAST LAST ROW` modes are supported.
319+
- [`PATTERN`](#pattern). Union pattern variables are not implemented.
320+
- [`DEFINE`](#define). Aggregation functions are not supported.
321+
322+
## Example of usage {#example}
323+
324+
Here is a hands-on example of pattern recognizing in a data table produced by an IoT device, where pressing its buttons triggers certain events. Let's assume you need to find and process the following sequence of button clicks: `button 1`, `button 2`, and `button 3`.
325+
326+
The structure of the data to transmit is as follows:
327+
328+
|ts|button|device_id|zone_id|
329+
|:-:|:-:|:-:|:-:|
330+
|600|3|17|3|
331+
|500|3|4|2|
332+
|400|2|17|3|
333+
|300|2|4|2|
334+
|200|1|17|3|
335+
|100|1|4|2|
336+
337+
The body of the SQL query looks like this:
338+
339+
```sql
340+
PRAGMA FeatureR010="prototype"; -- pragma for enabling MATCH_RECOGNIZE
341+
342+
SELECT * FROM input MATCH_RECOGNIZE ( -- Performing pattern matching from input
343+
PARTITION BY device_id, zone_id -- Partitioning the input data into groups by columns device_id and zone_id
344+
ORDER BY ts -- Viewing events based on the ts column data sorted ascending
345+
MEASURES
346+
LAST(B1.ts) AS b1, -- Going to get the latest timestamp of clicking button 1 in the query results
347+
LAST(B3.ts) AS b3 -- Going to get the latest timestamp of clicking button 3 in the query results
348+
ONE ROW PER MATCH -- Going to get one result row per match hit
349+
AFTER MATCH SKIP TO NEXT ROW -- Going to move to the next row once the match is found
350+
PATTERN (B1 B2+ B3) -- Searching for a pattern that includes one button 1 click, one or more button 2 clicks, and one button 3 click
351+
DEFINE
352+
B1 AS B1.button = 1, -- Defining the B1 variable as event of clicking button 1 (the button field equals 1)
353+
B2 AS B2.button = 2, -- Defining the B2 variable as event of clicking button 2 (the button field equals 2)
354+
B3 AS B3.button = 3 -- Defining the B3 variable as event of clicking button 3 (the button field equals 3)
355+
);
356+
```
357+
358+
Result:
359+
360+
|b1|b3|device_id|zone_id|
361+
|:-:|:-:|:-:|:-:|
362+
|100|500|4|2|
363+
|200|600|17|3|

ydb/docs/en/core/yql/reference/syntax/select/toc_i.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ items:
2121
- { name: LIMIT OFFSET, href: limit_offset.md }
2222
- { name: SAMPLE, href: sample.md }
2323
- { name: TABLESAMPLE, href: sample.md }
24+
- { name: MATCH_RECOGNIZE, href: match_recognize.md }

0 commit comments

Comments
 (0)