|
| 1 | +# sql: prepared statements |
| 2 | + |
| 3 | +* **Status**: In progress |
| 4 | +* **Start date**: 08-06-2019 |
| 5 | +* **Authors**: Nikita Pettik @korablev77 korablev@tarantool.org |
| 6 | +* **Issues**: #2592, #3292 |
| 7 | + |
| 8 | +# Summary |
| 9 | + |
| 10 | +Currently, there's only one unified way to execute SQL queries: local |
| 11 | +(box.execute()) and remote (nb:connect():execute() where nb is net-box module) |
| 12 | +methods. Both functions have the same calling interface: they accept string |
| 13 | +of SQL statement to be executed and optionally list of parameters to be bound. |
| 14 | +For instance: |
| 15 | + |
| 16 | +`box.execute("SELECT * FROM t WHERE a = ?", {12})` |
| 17 | + |
| 18 | +Local version of :execute() calls SQL query execution mechanism right |
| 19 | +after invokation. Meanwhile remote version makes up IProto request, sends it |
| 20 | +to the server side (with IPROTO_SQL_TEXT and IPROTO_SQL_BIND keys) and waits |
| 21 | +for the response-result which is sent back to the client until execution of |
| 22 | +statement is finished. |
| 23 | + |
| 24 | +Prepared statement is a feature allowing to execute the same (parameterized) |
| 25 | +statement repeatedly without recompilation overhead. The typical workflow |
| 26 | +with prepared statement is as follows: |
| 27 | + |
| 28 | +1. Preparation stage: at first statement is prepared (i.e. compiled into |
| 29 | + VDBE byte-code). At this moment parameters markers are left unspecified. |
| 30 | + Compiled statement is saved into cache and re-compiled or invalidated only |
| 31 | + on demand (e.g. after schema change). |
| 32 | + :prepare() method returns handle (id, object with :execute() method |
| 33 | + or whatever) which allows to execute query later. Alongside with |
| 34 | + handle, :prepare() may return meta-information including types of |
| 35 | + columns in resulting set, number of parameters and so forth. |
| 36 | +2. Execution stage: using mentioned handle, query can be located in cache. |
| 37 | + In case there's any variables to be bound, they are substituted firstly. |
| 38 | + Then byte-code implementing query is executed in virtual machine. |
| 39 | + |
| 40 | +Such two-stage schema has several advantages: |
| 41 | + |
| 42 | + - It allows to avoid query compilation overhead: it may turn out to be |
| 43 | + significant for queries with short run-time lifespan |
| 44 | + (e.g.`INSERT INTO t VALUES (?);`); |
| 45 | + - Prepared statements are resilient against SQL injections; |
| 46 | + - It is required to implement functionality of SQL drivers (e.g. support |
| 47 | + so called dry-run execution - prepare allows returning meta-information |
| 48 | + without execution overhead). |
| 49 | + |
| 50 | +# Other vendors specifications |
| 51 | + |
| 52 | +## MySQL |
| 53 | + |
| 54 | +### SQL syntax [1] |
| 55 | + |
| 56 | +``` |
| 57 | +PREPARE stmt FROM "SELECT SQRT(POW(?,2);" |
| 58 | +SET @a = 2 |
| 59 | +EXECUTE stmt USING @a; |
| 60 | +``` |
| 61 | + |
| 62 | +As one can see, PREPARE statement creates named handle for prepared statement |
| 63 | +object which is further passed to EXECUTE statement alongside with values |
| 64 | +to be bound. Prepared statement can be deallocated (i.e. erased from cache) |
| 65 | +manually with DEALLOCATE PREPARE method. |
| 66 | + |
| 67 | +### Protocol support [2] |
| 68 | + |
| 69 | +COM_STMT_PREPARE is a command which creates a prepared statement from |
| 70 | +the passed query string via MySQL's binary protocol. The only argument |
| 71 | +is a string containing SQL query. If it is succeeded, it sends response |
| 72 | +COM_STMT_PREPARE_OK which consists of next fields (a few the least important |
| 73 | +are omitted): |
| 74 | +``` |
| 75 | + - statement_id |
| 76 | + - num_columns |
| 77 | + - num_params |
| 78 | + [optional] if num_params > 0 |
| 79 | + [for each parameter] |
| 80 | + - parameter definition |
| 81 | + [optional] if num_columns > 0 |
| 82 | + [for each column] |
| 83 | + - column definition |
| 84 | +``` |
| 85 | +Parameter definition may include next properties [3]: table name, |
| 86 | +column name, type, character set, length of fixed-length fields, |
| 87 | +default value etc. Column definition consists of the same fields, |
| 88 | +so in this case their bodies are unified. |
| 89 | + |
| 90 | +To execute prepared statement protocol declares COM_STMT_EXECUTE [4] |
| 91 | +command. It takes id of statement to be executed and list of parameters |
| 92 | +to be bound. In case of success it returns OK_Packet [5]. |
| 93 | + |
| 94 | +Note there's no COM_STMT_PREPARE_AND_EXECUTE, i.e. protocol always |
| 95 | +requires preparation step. |
| 96 | + |
| 97 | +### Caching of Prepared Statements [6] |
| 98 | + |
| 99 | +Since prepared statements are supposed to be executed several times, |
| 100 | +the server converts the statement to an internal structure and caches that |
| 101 | +structure to be used during execution. In MySQL cache is session local: |
| 102 | +statements cached for one session are not accessible to other sessions. |
| 103 | +When session is closed, all statements are discarded. Moreover, statements |
| 104 | +are kept up to date (automatically re-compiled) in case of internal metadata |
| 105 | +changes caused by DDL operations. To limit number of prepared statements |
| 106 | +MySQL features `max_prepared_stmt_count` variable; setting it to 0 disables |
| 107 | +prepared statements at all. |
| 108 | + |
| 109 | +[1] https://dev.mysql.com/doc/refman/8.0/en/sql-syntax-prepared-statements.html |
| 110 | +[2] https://dev.mysql.com/doc/internals/en/com-stmt-prepare-response.html |
| 111 | +[3] https://dev.mysql.com/doc/internals/en/com-query-response.html#packet-Protocol::ColumnDefinition |
| 112 | +[4] https://dev.mysql.com/doc/internals/en/com-stmt-execute.html |
| 113 | +[5] https://dev.mysql.com/doc/internals/en/packet-OK_Packet.html |
| 114 | +[6] https://dev.mysql.com/doc/refman/5.6/en/statement-caching.html |
| 115 | + |
| 116 | +## PostgreSQL |
| 117 | + |
| 118 | +### SQL Syntax [1] |
| 119 | + |
| 120 | +``` |
| 121 | +PREPARE fooplan (int, text, bool, numeric) AS INSERT INTO foo VALUES($1, $2, $3, $4); |
| 122 | +EXECUTE fooplan(1, 'Hunter Valley', 't', 200.00); |
| 123 | +``` |
| 124 | + |
| 125 | +Usage is quite similar to one in MySQL: PREPARE creates named handle, |
| 126 | +which later can be executed with EXECUTE method. |
| 127 | + |
| 128 | +### Protocol support [2] |
| 129 | + |
| 130 | +Each SQL command can be executed via one of two sub-protocols. First |
| 131 | +(simple) one [3] accepts string containing SQL statement on the client |
| 132 | +side. On the server side, this request is parsed and executed "in one |
| 133 | +step", i.e. without preparation, binding parameters etc. This is quite |
| 134 | +similar to our current `:execute()` behaviour. Extended version of |
| 135 | +protocol [4] allows processing query in a series of steps: prepare, |
| 136 | +bind and execute. Firstly, client sends a `Parse` message, which contains |
| 137 | +SQL string and optionally name of statement to be prepared, number of |
| 138 | +parameters and their types (format of the messages can be seen here [5]). |
| 139 | +It is worth noting that specifying types for bindings can be quite |
| 140 | +meaningful in scope of improving static type system. If prepared |
| 141 | +statement is created unnamed (i.e. without specified name) it lasts only until |
| 142 | +the next `Parse` statement creating unnamed statement. Once a prepared |
| 143 | +statement exists, it can be readied for execution using a `Bind` message. |
| 144 | +`Bind` request accepts the name of prepared statement, the name of the |
| 145 | +destination portal (portal is an entry of the next stage of preparation), |
| 146 | +and the list of values to be bound. At this stage query planning takes |
| 147 | +place, and query plan can be cached if query is executed repeatedly. Once |
| 148 | +portal is created, it can be executed using `Execute` message. This request |
| 149 | +accepts name of portal and maximum result-row count (which allows suspending |
| 150 | +execution until the next call of `Execute` and sendind produced rows |
| 151 | +by batches - that's why it is called portal). In addition, there are several |
| 152 | +optional request types. For instance `Describe`, which returns meta-information |
| 153 | +of resulting set. |
| 154 | + |
| 155 | +### Caching of Prepared Statements [6] |
| 156 | + |
| 157 | +Prepared statements in PosgreSQL are local to session which means that they |
| 158 | +last for the duration of the current session and a single prepared statement |
| 159 | +cannot be used by multiple simultaneous database clients. Prepared statement |
| 160 | +not necessarily gets to the cache: |
| 161 | + |
| 162 | +"If a prepared statement is executed enough times, the server may |
| 163 | +eventually decide to save and re-use a generic plan rather than |
| 164 | +re-planning each time." |
| 165 | + |
| 166 | +[1] https://www.postgresql.org/docs/9.3/sql-prepare.html |
| 167 | +[2] https://www.postgresql.org/docs/10/protocol-overview.html |
| 168 | +[3] https://www.postgresql.org/docs/9.3/protocol-flow.html#AEN99807 |
| 169 | +[4] https://www.postgresql.org/docs/9.5/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY |
| 170 | +[5] https://www.postgresql.org/docs/9.3/protocol-message-formats.html |
| 171 | +[6] https://jdbc.postgresql.org/documentation/head/server-prepare.html |
| 172 | + |
| 173 | +## MS SQL Server |
| 174 | + |
| 175 | +It seems that manual prepare/execute interface is obsolete since |
| 176 | +MS Server provides automatic caching of queries based on their |
| 177 | +text representation. For details see: |
| 178 | + |
| 179 | +https://dba.stackexchange.com/questions/146092/microsoft-sql-server-prepared-statements |
| 180 | + |
| 181 | +Still, one can use unnecessary sp_prepare/sp_execute interface: |
| 182 | +https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-prepare-transact-sql?view=sql-server-2017 |
| 183 | + |
| 184 | +# Implementation details |
| 185 | + |
| 186 | +## Interface |
| 187 | + |
| 188 | +Firstly, let's introduce separate method :prepare() which takes string |
| 189 | +containing SQL statement and optionally list containing types of parameters |
| 190 | +to be bound. Local interface: |
| 191 | +`box.prepare("SQL statement", {array_of_bindings})`. |
| 192 | +`box.prepare()` (and netbox analogue `nb:prepare()`) returns object comprising: |
| 193 | +id of prepared statement (calculated as value of hash function applied to the |
| 194 | +original text of SQL query), count of parameters to be bound, map containing |
| 195 | +types and names of parameters, names and types of fields forming resulting set. |
| 196 | +Handle created by local `box.prepare` function features also `:execute()` and |
| 197 | +`:unprepare()` methods. For example: |
| 198 | +``` |
| 199 | +params_def = {} |
| 200 | +params_def[1] = "integer" |
| 201 | +params_def[2] = {} |
| 202 | +params_def[2]['@v'] = "number" |
| 203 | +local stmt = box.prepare("SELECT a, b, c FROM t WHERE a > ? AND a < @v", params_def) |
| 204 | +``` |
| 205 | +`:prepare()` compiles statement and saves it to the prepared statement |
| 206 | +cache on server side and returns handle to the object representing prepared |
| 207 | +statement on the client side. |
| 208 | +``` |
| 209 | +tarantool> stmt |
| 210 | +--- |
| 211 | +- stmt_id: 1307020572 |
| 212 | + params_count: 2 |
| 213 | + params: |
| 214 | + - name: '?' |
| 215 | + type: integer |
| 216 | + - name: '@v' |
| 217 | + type: number |
| 218 | + metadata: |
| 219 | + - name: A |
| 220 | + - type: integer |
| 221 | + - name: B |
| 222 | + - type: integer |
| 223 | + - name: C |
| 224 | + - type: integer |
| 225 | + execute: 'function: 0x010e720450' |
| 226 | + unprepare: 'function 0x030e430240' |
| 227 | +... |
| 228 | +``` |
| 229 | +To avoid breaking current interface, let's assume that all unspecified |
| 230 | +variables have most general type ANY: |
| 231 | +``` |
| 232 | +cn:prepare("SELECT ?;") |
| 233 | +--- |
| 234 | +- stmt_id: 1307020572 |
| 235 | + params_count: 1 |
| 236 | + params: |
| 237 | + - name: '?' |
| 238 | + type: ANY |
| 239 | + metadata: |
| 240 | + - name: '?' |
| 241 | + - type: ANY |
| 242 | +... |
| 243 | +``` |
| 244 | +When bindings are about to be substituted (via `:execute()` call), they |
| 245 | +are firstly checked to be of specified types. In case of type mismatch an |
| 246 | +error is raised. |
| 247 | + |
| 248 | +Remote `:prepare()` and `:unprepare()` make up IProto request with new |
| 249 | +IPROTO_PREPARE command. In case body contains IPROTO_SQL_TEXT key, this |
| 250 | +command is considered to be prepare request; if body comprises IPROTO_STMT_ID |
| 251 | +key, command is supposed to mean unprepare request. What is more, |
| 252 | +IPROTO_EXECUTE command now is overloaded in the same way: it can accept both |
| 253 | +IPROTO_SQL_TEXT and IPROTO_STMT_ID keys. Depending on request key, execute |
| 254 | +command may result either in casual compile-and-execute procedure or execution |
| 255 | +of prepared statement. |
| 256 | + |
| 257 | +## Prepared statement handle |
| 258 | + |
| 259 | +Different vendors use different handles to identify prepared statements. |
| 260 | +For instance, MySQL relies on sequential numeric ids; PosgreSQL gives |
| 261 | +unique character names to each prepared statement; Cassandra uses MD5 |
| 262 | +hash of original query as a prepared statement id. In current approach it is |
| 263 | +suggested to use numeric ids (values of hash function applied to the original |
| 264 | +string of SQL query) as prepared statement identifiers. To not bother with |
| 265 | +remembering ids on client side, users should operate on opaque |
| 266 | +`:execute()` method of prepared statement's handle. For example: |
| 267 | +``` |
| 268 | +local stmt = box.prepare("SELECT ?;") |
| 269 | +stmt:execute({1}) |
| 270 | +--- |
| 271 | +- rows: |
| 272 | + - [1] |
| 273 | +``` |
| 274 | +Now call of `:execute(args)` under the hood unfolds to |
| 275 | +`box.execute(stmt.query_id, args)` that is automatically substitutes |
| 276 | +appropriate id of prepared statement. The same concerns `:unprepare()` method. |
| 277 | + |
| 278 | +## Cache |
| 279 | + |
| 280 | +There's two main approaches concerning cache implementation. First one |
| 281 | +assumes that the prepared statement cache is session local; second one - |
| 282 | +that there's one cache which is global to all sessions. Session local |
| 283 | +cache allows specific queries to be kept in different places so that they |
| 284 | +wouldn't replace each other. For instance, DML requests can be executed |
| 285 | +and cached through the one session (`INSERT INTO t VALUES (?);`), |
| 286 | +meanwhile data selection occurs through another one. However, in this case |
| 287 | +sessions can't share one prepared statement object, which leads to possible |
| 288 | +performance issues. Thus, in Tarantool SQL it is suggested to use global |
| 289 | +holder for prepared statements. Also, it is worth mentioning that cache in |
| 290 | +fact is is not 'cache' in terms of invalidation policy: entries are erased |
| 291 | +from cache only on explicit unprepare requests or session's disconnect. |
| 292 | +Moreover, after any DDL operation all prepared statements are considered to be |
| 293 | +expired. Expired entry can't be executed without re-preparation. Size of |
| 294 | +cache is configured by `box.cfg.sql_cache_size` handle. There's also statistics |
| 295 | +available in box.info:sql().cache: |
| 296 | + - `size` is a total amount of memory consumed by prepared statements; |
| 297 | + - `stmt_count` is a number of prepared statements statement. |
| 298 | + |
| 299 | +When `:execute()` method is called and entry is found in prepared |
| 300 | +statement cache, it should be copied before execution. Cloning is |
| 301 | +required since byte-code can be self-modified during execution. |
| 302 | +Moreover, it allows dealing with statement duplicates, like: |
| 303 | +``` |
| 304 | +stmt1 = box.prepare("SELECT 1;") |
| 305 | +stmt2 = box.prepare("SELECT 1;") |
| 306 | +
|
| 307 | +stmt1:unprepare() |
| 308 | +stmt2:execute() |
| 309 | +``` |
| 310 | +Instead of adding query duplicates to the prepared statement cache, |
| 311 | +reference counter of corresponding prepared statement is incremented. |
| 312 | +Note that duplicates imply that several sessions can share one prepared |
| 313 | +statement. When reference counter reaches zero, prepared statement is to |
| 314 | +be deleted. What is more, copying solves another problem. During execution of |
| 315 | +huge SELECT query which may contain yields (for instance, UDF with sleep()), |
| 316 | +and if at the same time another session attempts at executing the same prepared |
| 317 | +statement, it will fail (or simply result in compile-and-execute procedure) |
| 318 | +since instance of prepared statement contains run-time atrributes (program |
| 319 | +counter, memory cells state and so on). Finally, another session firstly can |
| 320 | +invalidate prepared statement by executing DDL operation, and then re-compile |
| 321 | +statement currently being executed. The last scenarion may result in |
| 322 | +unpredictable consequences. |
| 323 | + |
| 324 | +It is supposed that there's no auto-caching (at least now). It means |
| 325 | +that query can get to the prepared statement cache only if explicit |
| 326 | +`:prepare()` invokation has taken place and is invalidated only by |
| 327 | +user request (or at the end of session). |
| 328 | + |
| 329 | +It is worth mentioning that each prepared statement also is assigned |
| 330 | +with schema version at the moment of its creation. If current schema version |
| 331 | +is different from prepared statement's one, error is raised saying that |
| 332 | +prepared statement is expired and it requires re-compilation. |
0 commit comments