|
| 1 | +# MATCH_RECOGNIZE |
| 2 | + |
| 3 | +The `MATCH_RECOGNIZE` expression performs pattern recognition in a sequence of rows and returns the found results. This functionality is important for various business areas, such as fraud detection, pricing analysis in finance, and sensor data processing. This area is known as Complex Event Processing (CEP), and pattern recognition is a valuable tool for this. An example of how `MATCH_RECOGNIZE` works is provided in the [link](#example). |
| 4 | + |
| 5 | +## Data processing algorithm |
| 6 | + |
| 7 | +The `MATCH_RECOGNIZE` expression performs the following actions: |
| 8 | + |
| 9 | +1. The input table is divided into non-overlapping groups. Each group consists of a set of rows from the input table with identical values in the columns listed after `PARTITION BY`. |
| 10 | +2. Each group is ordered according to the `ORDER BY` clause. |
| 11 | +3. Recognition of pattern from `PATTERN` is performed independently in each ordered group. |
| 12 | +4. Pattern search in the sequence of rows is a step-by-step process: rows are checked one by one if they fit the pattern. Among all matches starting in the earliest row, the one consisting of the largest number of rows is selected. If no matches were found starting in the earliest row, the search continues starting from the next row. |
| 13 | +5. After a match is found, the columns defined by expressions in the `MEASURES` block are calculated. |
| 14 | +6. Depending on the `ROWS PER MATCH` mode, one or all rows for the found match are output. |
| 15 | +7. The `AFTER MATCH SKIP` mode determines from which row the pattern recognition will resume. |
| 16 | + |
| 17 | +## Syntax {#syntax} |
| 18 | + |
| 19 | +```sql |
| 20 | +MATCH_RECOGNIZE ( |
| 21 | + [ PARTITION BY <partition_1> [ ... , <partition_N> ] ] |
| 22 | + [ ORDER BY <sort_key_1> [ ... , <sort_key_N> ] ] |
| 23 | + [ MEASURES <expression_1> AS <column_name_1> [ ... , <expression_N> AS <column_name_N> ] ] |
| 24 | + [ ROWS PER MATCH ] |
| 25 | + [ AFTER MATCH SKIP ] |
| 26 | + PATTERN (<search_pattern>) |
| 27 | + DEFINE <variable_1> AS <predicate_1> [ ... , <variable_N> AS <predicate_N> ] |
| 28 | +) |
| 29 | +``` |
| 30 | + |
| 31 | +Here is a brief description of the SQL syntax elements of the `MATCH_RECOGNIZE` expression: |
| 32 | + |
| 33 | +* [`DEFINE`](#define): Block for declaring variables that describe the search pattern and the conditions that rows must meet for each variable. |
| 34 | +* [`PATTERN`](#pattern): [Regular expressions](https://en.wikipedia.org/wiki/Regular_expressions) describing the search pattern. |
| 35 | +* [`MEASURES`](#measures): Defines the list of columns for the returned data. Each column is specified by an SQL expression for its computation. |
| 36 | +* [`ROWS PER MATCH`](#rows_per_match): Determines the structure of the returned data and the number of rows for each match found. |
| 37 | +* [`AFTER MATCH SKIP`](#after_match_skip): Defines the method of moving to the point of the next match search. |
| 38 | +* [`ORDER BY`](#order_by): Determines sorting of input data. Pattern search is performed within the data sorted according to the list of columns or expressions listed in `<sort_key_1> [ ... , <sort_key_N> ]`. |
| 39 | +* [`PARTITION BY`](#partition_by): Divides the input table according to the specified rules in accordance with `<partition_1> [ ... , <partition_N> ]`. Pattern search is performed independently in each part. |
| 40 | + |
| 41 | +### DEFINE {#define} |
| 42 | + |
| 43 | +```sql |
| 44 | +DEFINE <variable_1> AS <predicate_1> [ ... , <variable_N> AS <predicate_N> ] |
| 45 | +``` |
| 46 | + |
| 47 | +`DEFINE` declares variables that are used to describe the desired pattern defined in [`PATTERN`](#pattern). Variables are named SQL statements evaluated over the input data. The syntax of the SQL statements in `DEFINE` is the same as the SQL statements of the `WHERE` predicate. For example, the `button = 1` expression searches for rows with the value `1` in the `button` column. Any SQL expressions that can be used to perform a search, including aggregation functions (`LAST`, `FIRST`). For example, `button > 2 AND zone_id < 12` or `LAST(button) > 10`. |
| 48 | + |
| 49 | +In the example below, the SQL statement `A.button = 1` is declared as variable `A`. |
| 50 | + |
| 51 | +```sql |
| 52 | +DEFINE |
| 53 | + A AS A.button = 1 |
| 54 | +``` |
| 55 | + |
| 56 | +{% note info %} |
| 57 | + |
| 58 | +`DEFINE` does not currently support aggregation functions (e.g., `AVG`, `MIN`, or `MAX`) and `PREV` and `NEXT` functions. |
| 59 | + |
| 60 | +{% endnote %} |
| 61 | + |
| 62 | +When processing each row of data, all SQL statements describing variables in `DEFINE` are calculated. When the SQL-expression describing the corresponding variable from `DEFINE` gets the `TRUE` value, such a row is labeled with the `DEFINE` variable name and added to the list of rows subject to pattern matching. |
| 63 | + |
| 64 | +#### **Example** {#define-example} |
| 65 | + |
| 66 | +When defining variables in SQL expressions, you can reference other variables: |
| 67 | + |
| 68 | +```sql |
| 69 | +DEFINE |
| 70 | + A AS A.button = 1 AND LAST(A.zone_id) = 12, |
| 71 | + B AS B.button = 2 AND FIRST(A.zone_id) = 12 |
| 72 | +``` |
| 73 | + |
| 74 | +An input data row will be computed as variable `A` if it contains a `button` column with value `1` and the last row of the set of previously matched `A` has a `zone_id` column with value `12`. The row will be computed as variable `B` if the data row contains a `button` column with value `2` and the first row of the set of previously matched variables `A` has a `zone_id` column with value `12`. |
| 75 | + |
| 76 | +### PATTERN {#pattern} |
| 77 | + |
| 78 | +```sql |
| 79 | +PATTERN (<search_pattern>) |
| 80 | +``` |
| 81 | + |
| 82 | +The `PATTERN` keyword describes the search pattern in the format derived from variables in the `DEFINE` section. The `PATTERN` syntax is similar to the one of [regular expressions](https://en.wikipedia.org/wiki/Regular_expressions). |
| 83 | + |
| 84 | +{% note warning %} |
| 85 | + |
| 86 | +If a variable used in the `PATTERN` section has not been previously described in the `DEFINE` section, it is assumed that it is always `TRUE`. |
| 87 | + |
| 88 | +{% endnote %} |
| 89 | + |
| 90 | +You can use [quantifiers](https://en.wikipedia.org/wiki/Regular_expression#Quantification) in `PATTERN`. In regular expressions, they determine the number of repetitions of an element or subsequence in the matched pattern. Here is the list of supported quantifiers: |
| 91 | + |
| 92 | +|Quantifier|Description| |
| 93 | +|-|-| |
| 94 | +|`A+`|One or more occurrences of `A`| |
| 95 | +|`A*`|Zero or more occurrences of `A`| |
| 96 | +|`A?`|Zero or one occurrence of `A`| |
| 97 | +|`B{n}`|Exactly `n` occurrences of `B`| |
| 98 | +|`C{n, m}`|From `n` to `m` occurrences of `C`| |
| 99 | +|`D{n,}`|At least `n` occurrences of `D`| |
| 100 | +|`(A\|B)`|Occurrence of `A` or `B` in the data| |
| 101 | +|`(A\|B){,m}`|From zero to `m` occurrences of `A` or `B`| |
| 102 | + |
| 103 | +Supported pattern search sequences: |
| 104 | + |
| 105 | +|Supported sequences|Syntax|Description| |
| 106 | +|-|-|-| |
| 107 | +|Sequence|`A B+ C+ D+`|The system searches for the exact specified sequence, the occurrence of other variables within the sequence is not allowed. The pattern search is performed in the order of the pattern variables.| |
| 108 | +|One of|`A \| B \| C`|Variables are listed in any order with a pipe \| between them. The search is performed for any variable from the specified list.| |
| 109 | +|Grouping|`(A \| B)+ \| C`|Variables inside round brackets are considered a single group. In this case, quantifiers apply to the entire group.| |
| 110 | +|Exclusion from result|`{- A B+ C -}`|Rows found by the pattern in parentheses will be excluded from the result in [`ALL ROWS PER MATCH`](#rows_per_match) mode| |
| 111 | + |
| 112 | +#### **Example** {#pattern-example} |
| 113 | + |
| 114 | +```sql |
| 115 | +PATTERN (B1 E* B2+ B3) |
| 116 | +DEFINE |
| 117 | + B1 as B1.button = 1, |
| 118 | + B2 as B2.button = 2, |
| 119 | + B3 as B3.button = 3 |
| 120 | +``` |
| 121 | + |
| 122 | +The `DEFINE` section describes the `B1`, `B2`, and `B3` variables, while it does not describe `E`. Such notation allows interpreting `E` as any event, so the following pattern will be searched: one `button 1` click, one or more `button 2` clicks, and one `button 3` click. Meanwhile, between a click of `button 1` and `button 2`, any number of any other events may occur. |
| 123 | + |
| 124 | +### MEASURES {#measures} |
| 125 | + |
| 126 | +```sql |
| 127 | +MEASURES <expression_1> AS <column_name_1> [ ... , <expression_N> AS <column_name_N> ] |
| 128 | +``` |
| 129 | + |
| 130 | +`MEASURES` describes the set of returned columns when a pattern is found. A set of returned columns should be represented by an SQL expression with the aggregate functions over the variables declared in the [`DEFINE`](#define) statement. |
| 131 | + |
| 132 | +#### **Example** {#measures-example} |
| 133 | + |
| 134 | +The input data for the example are: |
| 135 | + |
| 136 | +|ts|button|device_id|zone_id| |
| 137 | +|:-:|:-:|:-:|:-:| |
| 138 | +|100|1|3|0| |
| 139 | +|200|1|3|1| |
| 140 | +|300|2|2|0| |
| 141 | +|400|3|1|1| |
| 142 | + |
| 143 | +```sql |
| 144 | +MEASURES |
| 145 | + AGGREGATE_LIST(B1.zone_id * 10 + B1.device_id) AS ids, |
| 146 | + COUNT(DISTINCT B1.zone_id) AS count_zones, |
| 147 | + LAST(B3.ts) - FIRST(B1.ts) AS time_diff, |
| 148 | + 42 AS meaning_of_life |
| 149 | +PATTERN (B1+ B2 B3) |
| 150 | +DEFINE |
| 151 | + B1 AS B1.button = 1, |
| 152 | + B2 AS B2.button = 2, |
| 153 | + B3 AS B3.button = 3 |
| 154 | +``` |
| 155 | + |
| 156 | +Result: |
| 157 | + |
| 158 | +|ids|count_zones|time_diff|meaning_of_life| |
| 159 | +|:-:|:-:|:-:|:-:| |
| 160 | +|[3,13]|2|300|42| |
| 161 | + |
| 162 | +The `ids` column contains the list of `zone_id * 10 + device_id` values counted among the rows matched with the `B1` variable. The `count_zones` column contains the number of unique values of the `zone_id` column among the rows matched with the `B1` variable. Column `time_diff` contains the difference between the value of column `ts` in the last row of the set of rows matched with variable `B3` and the value of column `ts` in the first row of the set of rows matched with variable `B1`. The `meaning_of_life` column contains the constant `42`. Thus, an expression in `MEASURES` may contain aggregate functions over multiple variables, but there must be only one variable within a single aggregate function. |
| 163 | + |
| 164 | +### ROWS PER MATCH {#rows_per_match} |
| 165 | + |
| 166 | +`ROWS PER MATCH` determines the number of result rows for each match found, as well as the number of columns returned. The default mode is `ONE ROW PER MATCH`. |
| 167 | + |
| 168 | +`ONE ROW PER MATCH` sets the `ROWS PER MATCH` mode to output one row for the match found. The structure of the returned data corresponds to the columns listed in [`PARTITION BY`](#partition_by) and [`MEASURES`](#measures). |
| 169 | + |
| 170 | +`ALL ROWS PER MATCH` sets the `ROWS PER MATCH` mode to output all rows of the match found except explicitly excluded by parentheses. In addition to the columns of the source table, the structure of the returned data includes the columns listed in the [`MEASURES`](#measures). |
| 171 | + |
| 172 | +#### **Examples** {#rows_per_match-examples} |
| 173 | + |
| 174 | +The input data for all examples are: |
| 175 | + |
| 176 | +|ts|button| |
| 177 | +|:-:|:-:| |
| 178 | +|100|1| |
| 179 | +|200|2| |
| 180 | +|300|3| |
| 181 | + |
| 182 | +##### **Example 1** {#rows_per_match-example1} |
| 183 | + |
| 184 | +```sql |
| 185 | +MEASURES |
| 186 | + FIRST(B1.ts) AS first_ts, |
| 187 | + FIRST(B2.ts) AS mid_ts, |
| 188 | + LAST(B3.ts) AS last_ts |
| 189 | +ONE ROW PER MATCH |
| 190 | +PATTERN (B1 {- B2 -} B3) |
| 191 | +DEFINE |
| 192 | + B1 AS B1.button = 1, |
| 193 | + B2 AS B2.button = 2, |
| 194 | + B3 AS B3.button = 3 |
| 195 | +``` |
| 196 | + |
| 197 | +Result: |
| 198 | + |
| 199 | +|first_ts|mid_ts|last_ts| |
| 200 | +|:-:|:-:|:-:| |
| 201 | +|100|200|300| |
| 202 | + |
| 203 | +##### **Example 2** {#rows_per_match-example2} |
| 204 | + |
| 205 | +```sql |
| 206 | +MEASURES |
| 207 | + FIRST(B1.ts) AS first_ts, |
| 208 | + FIRST(B2.ts) AS mid_ts, |
| 209 | + LAST(B3.ts) AS last_ts |
| 210 | +ALL ROWS PER MATCH |
| 211 | +PATTERN (B1 {- B2 -} B3) |
| 212 | +DEFINE |
| 213 | + B1 AS B1.button = 1, |
| 214 | + B2 AS B2.button = 2, |
| 215 | + B3 AS B3.button = 3 |
| 216 | +``` |
| 217 | + |
| 218 | +Result: |
| 219 | + |
| 220 | +|first_ts|mid_ts|last_ts|button|ts| |
| 221 | +|:-:|:-:|:-:|:-:|:-:| |
| 222 | +|100|200|300|1|100| |
| 223 | +|100|200|300|3|300| |
| 224 | + |
| 225 | +### AFTER MATCH SKIP {#after_match_skip} |
| 226 | + |
| 227 | +`AFTER MATCH SKIP` determines the method of transitioning from the found match to searching for the next one. In the `AFTER MATCH SKIP TO NEXT ROW` mode, the search for the next match starts after the first row of the previous one, while in the `AFTER MATCH SKIP PAST LAST ROW` mode it starts after the last row of the previous match. The default mode is `PAST LAST ROW`. |
| 228 | + |
| 229 | +#### Examples {#after_match_skip-examples} |
| 230 | + |
| 231 | +The input data for all examples are: |
| 232 | + |
| 233 | +|ts|button| |
| 234 | +|:-:|:-:| |
| 235 | +|100|1| |
| 236 | +|200|1| |
| 237 | +|300|2| |
| 238 | +|400|3| |
| 239 | + |
| 240 | +##### **Example 1** {#after_match_skip-example1} |
| 241 | + |
| 242 | +```sql |
| 243 | +MEASURES |
| 244 | + FIRST(B1.ts) AS first_ts, |
| 245 | + LAST(B3.ts) AS last_ts |
| 246 | +AFTER MATCH SKIP TO NEXT ROW |
| 247 | +PATTERN (B1+ B2 B3) |
| 248 | +DEFINE |
| 249 | + B1 AS B1.button = 1, |
| 250 | + B2 AS B2.button = 2, |
| 251 | + B3 AS B3.button = 3 |
| 252 | +``` |
| 253 | + |
| 254 | +Result: |
| 255 | + |
| 256 | +|first_ts|last_ts| |
| 257 | +|:-:|:-:| |
| 258 | +|100|400| |
| 259 | +|200|400| |
| 260 | + |
| 261 | +##### **Example 2** {#after_match_skip-example2} |
| 262 | + |
| 263 | +```sql |
| 264 | +MEASURES |
| 265 | + FIRST(B1.ts) AS first_ts, |
| 266 | + LAST(B3.ts) AS last_ts |
| 267 | +AFTER MATCH SKIP PAST LAST ROW |
| 268 | +PATTERN (B1+ B2 B3) |
| 269 | +DEFINE |
| 270 | + B1 AS B1.button = 1, |
| 271 | + B2 AS B2.button = 2, |
| 272 | + B3 AS B3.button = 3 |
| 273 | +``` |
| 274 | + |
| 275 | +Result: |
| 276 | + |
| 277 | +|first_ts|last_ts| |
| 278 | +|:-:|:-:| |
| 279 | +|100|400| |
| 280 | + |
| 281 | +### ORDER BY {#order_by} |
| 282 | + |
| 283 | +```sql |
| 284 | +ORDER BY <sort_key_1> [ ... , <sort_key_N> ] |
| 285 | + |
| 286 | +<sort_key> ::= { <column_names> | <expression> } |
| 287 | +``` |
| 288 | + |
| 289 | +`ORDER BY` determines sorting of the input data. That is, before all pattern search operations are performed, the data will be pre-sorted according to the specified keys or expressions. The syntax is similar to the `ORDER BY` SQL expression. |
| 290 | + |
| 291 | +#### **Example** {#order_by-example} |
| 292 | + |
| 293 | +```sql |
| 294 | +ORDER BY CAST(ts AS Timestamp) |
| 295 | +``` |
| 296 | + |
| 297 | +### PARTITION BY {#partition_by} |
| 298 | + |
| 299 | +```sql |
| 300 | +PARTITION BY <partition_1> [ ... , <partition_N> ] |
| 301 | + |
| 302 | +<partition> ::= { <column_names> | <expression> } |
| 303 | +``` |
| 304 | + |
| 305 | +`PARTITION BY` partitions the source data into multiple non-overlapping groups, each used for an independent pattern search. If the expression is not specified, all data is processed as a single group. Records with the same values of the columns listed after `PARTITION BY` fall into the same group. |
| 306 | + |
| 307 | +#### **Example** {#partition_by-example} |
| 308 | + |
| 309 | +```sql |
| 310 | +PARTITION BY device_id, zone_id |
| 311 | +``` |
| 312 | + |
| 313 | +## Limitations {#limitations} |
| 314 | + |
| 315 | +Our support for the `MATCH_RECOGNIZE` expression will eventually comply with [SQL-2016](https://ru.wikipedia.org/wiki/SQL:2016); currently, however, the following limitations apply: |
| 316 | + |
| 317 | +- [`MEASURES`](#measures). Functions `PREV`/`NEXT` are not supported. |
| 318 | +- [`AFTER MATCH SKIP`](#after_match_skip). Only the `AFTER MATCH SKIP TO NEXT ROW` and `AFTER MATCH SKIP PAST LAST ROW` modes are supported. |
| 319 | +- [`PATTERN`](#pattern). Union pattern variables are not implemented. |
| 320 | +- [`DEFINE`](#define). Aggregation functions are not supported. |
| 321 | + |
| 322 | +## Example of usage {#example} |
| 323 | + |
| 324 | +Here is a hands-on example of pattern recognizing in a data table produced by an IoT device, where pressing its buttons triggers certain events. Let's assume you need to find and process the following sequence of button clicks: `button 1`, `button 2`, and `button 3`. |
| 325 | + |
| 326 | +The structure of the data to transmit is as follows: |
| 327 | + |
| 328 | +|ts|button|device_id|zone_id| |
| 329 | +|:-:|:-:|:-:|:-:| |
| 330 | +|600|3|17|3| |
| 331 | +|500|3|4|2| |
| 332 | +|400|2|17|3| |
| 333 | +|300|2|4|2| |
| 334 | +|200|1|17|3| |
| 335 | +|100|1|4|2| |
| 336 | + |
| 337 | +The body of the SQL query looks like this: |
| 338 | + |
| 339 | +```sql |
| 340 | +PRAGMA FeatureR010="prototype"; -- pragma for enabling MATCH_RECOGNIZE |
| 341 | + |
| 342 | +SELECT * FROM input MATCH_RECOGNIZE ( -- Performing pattern matching from input |
| 343 | + PARTITION BY device_id, zone_id -- Partitioning the input data into groups by columns device_id and zone_id |
| 344 | + ORDER BY ts -- Viewing events based on the ts column data sorted ascending |
| 345 | + MEASURES |
| 346 | + LAST(B1.ts) AS b1, -- Going to get the latest timestamp of clicking button 1 in the query results |
| 347 | + LAST(B3.ts) AS b3 -- Going to get the latest timestamp of clicking button 3 in the query results |
| 348 | + ONE ROW PER MATCH -- Going to get one result row per match hit |
| 349 | + AFTER MATCH SKIP TO NEXT ROW -- Going to move to the next row once the match is found |
| 350 | + PATTERN (B1 B2+ B3) -- Searching for a pattern that includes one button 1 click, one or more button 2 clicks, and one button 3 click |
| 351 | + DEFINE |
| 352 | + B1 AS B1.button = 1, -- Defining the B1 variable as event of clicking button 1 (the button field equals 1) |
| 353 | + B2 AS B2.button = 2, -- Defining the B2 variable as event of clicking button 2 (the button field equals 2) |
| 354 | + B3 AS B3.button = 3 -- Defining the B3 variable as event of clicking button 3 (the button field equals 3) |
| 355 | +); |
| 356 | +``` |
| 357 | + |
| 358 | +Result: |
| 359 | + |
| 360 | +|b1|b3|device_id|zone_id| |
| 361 | +|:-:|:-:|:-:|:-:| |
| 362 | +|100|500|4|2| |
| 363 | +|200|600|17|3| |
0 commit comments