Skip to content

Commit 9887ca8

Browse files
committed
sql: rfc on prepared statements
1 parent 4137134 commit 9887ca8

File tree

1 file changed

+332
-0
lines changed

1 file changed

+332
-0
lines changed

doc/rfc/2592-prepared-statement.md

+332
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
# sql: prepared statements
2+
3+
* **Status**: In progress
4+
* **Start date**: 08-06-2019
5+
* **Authors**: Nikita Pettik @korablev77 korablev@tarantool.org
6+
* **Issues**: #2592, #3292
7+
8+
# Summary
9+
10+
Currently, there's only one unified way to execute SQL queries: local
11+
(box.execute()) and remote (nb:connect():execute() where nb is net-box module)
12+
methods. Both functions have the same calling interface: they accept string
13+
of SQL statement to be executed and optionally list of parameters to be bound.
14+
For instance:
15+
16+
`box.execute("SELECT * FROM t WHERE a = ?", {12})`
17+
18+
Local version of :execute() calls SQL query execution mechanism right
19+
after invokation. Meanwhile remote version makes up IProto request, sends it
20+
to the server side (with IPROTO_SQL_TEXT and IPROTO_SQL_BIND keys) and waits
21+
for the response-result which is sent back to the client until execution of
22+
statement is finished.
23+
24+
Prepared statement is a feature allowing to execute the same (parameterized)
25+
statement repeatedly without recompilation overhead. The typical workflow
26+
with prepared statement is as follows:
27+
28+
1. Preparation stage: at first statement is prepared (i.e. compiled into
29+
VDBE byte-code). At this moment parameters markers are left unspecified.
30+
Compiled statement is saved into cache and re-compiled or invalidated only
31+
on demand (e.g. after schema change).
32+
:prepare() method returns handle (id, object with :execute() method
33+
or whatever) which allows to execute query later. Alongside with
34+
handle, :prepare() may return meta-information including types of
35+
columns in resulting set, number of parameters and so forth.
36+
2. Execution stage: using mentioned handle, query can be located in cache.
37+
In case there's any variables to be bound, they are substituted firstly.
38+
Then byte-code implementing query is executed in virtual machine.
39+
40+
Such two-stage schema has several advantages:
41+
42+
- It allows to avoid query compilation overhead: it may turn out to be
43+
significant for queries with short run-time lifespan
44+
(e.g.`INSERT INTO t VALUES (?);`);
45+
- Prepared statements are resilient against SQL injections;
46+
- It is required to implement functionality of SQL drivers (e.g. support
47+
so called dry-run execution - prepare allows returning meta-information
48+
without execution overhead).
49+
50+
# Other vendors specifications
51+
52+
## MySQL
53+
54+
### SQL syntax [1]
55+
56+
```
57+
PREPARE stmt FROM "SELECT SQRT(POW(?,2);"
58+
SET @a = 2
59+
EXECUTE stmt USING @a;
60+
```
61+
62+
As one can see, PREPARE statement creates named handle for prepared statement
63+
object which is further passed to EXECUTE statement alongside with values
64+
to be bound. Prepared statement can be deallocated (i.e. erased from cache)
65+
manually with DEALLOCATE PREPARE method.
66+
67+
### Protocol support [2]
68+
69+
COM_STMT_PREPARE is a command which creates a prepared statement from
70+
the passed query string via MySQL's binary protocol. The only argument
71+
is a string containing SQL query. If it is succeeded, it sends response
72+
COM_STMT_PREPARE_OK which consists of next fields (a few the least important
73+
are omitted):
74+
```
75+
- statement_id
76+
- num_columns
77+
- num_params
78+
[optional] if num_params > 0
79+
[for each parameter]
80+
- parameter definition
81+
[optional] if num_columns > 0
82+
[for each column]
83+
- column definition
84+
```
85+
Parameter definition may include next properties [3]: table name,
86+
column name, type, character set, length of fixed-length fields,
87+
default value etc. Column definition consists of the same fields,
88+
so in this case their bodies are unified.
89+
90+
To execute prepared statement protocol declares COM_STMT_EXECUTE [4]
91+
command. It takes id of statement to be executed and list of parameters
92+
to be bound. In case of success it returns OK_Packet [5].
93+
94+
Note there's no COM_STMT_PREPARE_AND_EXECUTE, i.e. protocol always
95+
requires preparation step.
96+
97+
### Caching of Prepared Statements [6]
98+
99+
Since prepared statements are supposed to be executed several times,
100+
the server converts the statement to an internal structure and caches that
101+
structure to be used during execution. In MySQL cache is session local:
102+
statements cached for one session are not accessible to other sessions.
103+
When session is closed, all statements are discarded. Moreover, statements
104+
are kept up to date (automatically re-compiled) in case of internal metadata
105+
changes caused by DDL operations. To limit number of prepared statements
106+
MySQL features `max_prepared_stmt_count` variable; setting it to 0 disables
107+
prepared statements at all.
108+
109+
[1] https://dev.mysql.com/doc/refman/8.0/en/sql-syntax-prepared-statements.html
110+
[2] https://dev.mysql.com/doc/internals/en/com-stmt-prepare-response.html
111+
[3] https://dev.mysql.com/doc/internals/en/com-query-response.html#packet-Protocol::ColumnDefinition
112+
[4] https://dev.mysql.com/doc/internals/en/com-stmt-execute.html
113+
[5] https://dev.mysql.com/doc/internals/en/packet-OK_Packet.html
114+
[6] https://dev.mysql.com/doc/refman/5.6/en/statement-caching.html
115+
116+
## PostgreSQL
117+
118+
### SQL Syntax [1]
119+
120+
```
121+
PREPARE fooplan (int, text, bool, numeric) AS INSERT INTO foo VALUES($1, $2, $3, $4);
122+
EXECUTE fooplan(1, 'Hunter Valley', 't', 200.00);
123+
```
124+
125+
Usage is quite similar to one in MySQL: PREPARE creates named handle,
126+
which later can be executed with EXECUTE method.
127+
128+
### Protocol support [2]
129+
130+
Each SQL command can be executed via one of two sub-protocols. First
131+
(simple) one [3] accepts string containing SQL statement on the client
132+
side. On the server side, this request is parsed and executed "in one
133+
step", i.e. without preparation, binding parameters etc. This is quite
134+
similar to our current `:execute()` behaviour. Extended version of
135+
protocol [4] allows processing query in a series of steps: prepare,
136+
bind and execute. Firstly, client sends a `Parse` message, which contains
137+
SQL string and optionally name of statement to be prepared, number of
138+
parameters and their types (format of the messages can be seen here [5]).
139+
It is worth noting that specifying types for bindings can be quite
140+
meaningful in scope of improving static type system. If prepared
141+
statement is created unnamed (i.e. without specified name) it lasts only until
142+
the next `Parse` statement creating unnamed statement. Once a prepared
143+
statement exists, it can be readied for execution using a `Bind` message.
144+
`Bind` request accepts the name of prepared statement, the name of the
145+
destination portal (portal is an entry of the next stage of preparation),
146+
and the list of values to be bound. At this stage query planning takes
147+
place, and query plan can be cached if query is executed repeatedly. Once
148+
portal is created, it can be executed using `Execute` message. This request
149+
accepts name of portal and maximum result-row count (which allows suspending
150+
execution until the next call of `Execute` and sendind produced rows
151+
by batches - that's why it is called portal). In addition, there are several
152+
optional request types. For instance `Describe`, which returns meta-information
153+
of resulting set.
154+
155+
### Caching of Prepared Statements [6]
156+
157+
Prepared statements in PosgreSQL are local to session which means that they
158+
last for the duration of the current session and a single prepared statement
159+
cannot be used by multiple simultaneous database clients. Prepared statement
160+
not necessarily gets to the cache:
161+
162+
"If a prepared statement is executed enough times, the server may
163+
eventually decide to save and re-use a generic plan rather than
164+
re-planning each time."
165+
166+
[1] https://www.postgresql.org/docs/9.3/sql-prepare.html
167+
[2] https://www.postgresql.org/docs/10/protocol-overview.html
168+
[3] https://www.postgresql.org/docs/9.3/protocol-flow.html#AEN99807
169+
[4] https://www.postgresql.org/docs/9.5/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY
170+
[5] https://www.postgresql.org/docs/9.3/protocol-message-formats.html
171+
[6] https://jdbc.postgresql.org/documentation/head/server-prepare.html
172+
173+
## MS SQL Server
174+
175+
It seems that manual prepare/execute interface is obsolete since
176+
MS Server provides automatic caching of queries based on their
177+
text representation. For details see:
178+
179+
https://dba.stackexchange.com/questions/146092/microsoft-sql-server-prepared-statements
180+
181+
Still, one can use unnecessary sp_prepare/sp_execute interface:
182+
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-prepare-transact-sql?view=sql-server-2017
183+
184+
# Implementation details
185+
186+
## Interface
187+
188+
Firstly, let's introduce separate method :prepare() which takes string
189+
containing SQL statement and optionally list containing types of parameters
190+
to be bound. Local interface:
191+
`box.prepare("SQL statement", {array_of_bindings})`.
192+
`box.prepare()` (and netbox analogue `nb:prepare()`) returns object comprising:
193+
id of prepared statement (calculated as value of hash function applied to the
194+
original text of SQL query), count of parameters to be bound, map containing
195+
types and names of parameters, names and types of fields forming resulting set.
196+
Handle created by local `box.prepare` function features also `:execute()` and
197+
`:unprepare()` methods. For example:
198+
```
199+
params_def = {}
200+
params_def[1] = "integer"
201+
params_def[2] = {}
202+
params_def[2]['@v'] = "number"
203+
local stmt = box.prepare("SELECT a, b, c FROM t WHERE a > ? AND a < @v", params_def)
204+
```
205+
`:prepare()` compiles statement and saves it to the prepared statement
206+
cache on server side and returns handle to the object representing prepared
207+
statement on the client side.
208+
```
209+
tarantool> stmt
210+
---
211+
- stmt_id: 1307020572
212+
params_count: 2
213+
params:
214+
- name: '?'
215+
type: integer
216+
- name: '@v'
217+
type: number
218+
metadata:
219+
- name: A
220+
- type: integer
221+
- name: B
222+
- type: integer
223+
- name: C
224+
- type: integer
225+
execute: 'function: 0x010e720450'
226+
unprepare: 'function 0x030e430240'
227+
...
228+
```
229+
To avoid breaking current interface, let's assume that all unspecified
230+
variables have most general type ANY:
231+
```
232+
cn:prepare("SELECT ?;")
233+
---
234+
- stmt_id: 1307020572
235+
params_count: 1
236+
params:
237+
- name: '?'
238+
type: ANY
239+
metadata:
240+
- name: '?'
241+
- type: ANY
242+
...
243+
```
244+
When bindings are about to be substituted (via `:execute()` call), they
245+
are firstly checked to be of specified types. In case of type mismatch an
246+
error is raised.
247+
248+
Remote `:prepare()` and `:unprepare()` make up IProto request with new
249+
IPROTO_PREPARE command. In case body contains IPROTO_SQL_TEXT key, this
250+
command is considered to be prepare request; if body comprises IPROTO_STMT_ID
251+
key, command is supposed to mean unprepare request. What is more,
252+
IPROTO_EXECUTE command now is overloaded in the same way: it can accept both
253+
IPROTO_SQL_TEXT and IPROTO_STMT_ID keys. Depending on request key, execute
254+
command may result either in casual compile-and-execute procedure or execution
255+
of prepared statement.
256+
257+
## Prepared statement handle
258+
259+
Different vendors use different handles to identify prepared statements.
260+
For instance, MySQL relies on sequential numeric ids; PosgreSQL gives
261+
unique character names to each prepared statement; Cassandra uses MD5
262+
hash of original query as a prepared statement id. In current approach it is
263+
suggested to use numeric ids (values of hash function applied to the original
264+
string of SQL query) as prepared statement identifiers. To not bother with
265+
remembering ids on client side, users should operate on opaque
266+
`:execute()` method of prepared statement's handle. For example:
267+
```
268+
local stmt = box.prepare("SELECT ?;")
269+
stmt:execute({1})
270+
---
271+
- rows:
272+
- [1]
273+
```
274+
Now call of `:execute(args)` under the hood unfolds to
275+
`box.execute(stmt.query_id, args)` that is automatically substitutes
276+
appropriate id of prepared statement. The same concerns `:unprepare()` method.
277+
278+
## Cache
279+
280+
There's two main approaches concerning cache implementation. First one
281+
assumes that the prepared statement cache is session local; second one -
282+
that there's one cache which is global to all sessions. Session local
283+
cache allows specific queries to be kept in different places so that they
284+
wouldn't replace each other. For instance, DML requests can be executed
285+
and cached through the one session (`INSERT INTO t VALUES (?);`),
286+
meanwhile data selection occurs through another one. However, in this case
287+
sessions can't share one prepared statement object, which leads to possible
288+
performance issues. Thus, in Tarantool SQL it is suggested to use global
289+
holder for prepared statements. Also, it is worth mentioning that cache in
290+
fact is is not 'cache' in terms of invalidation policy: entries are erased
291+
from cache only on explicit unprepare requests or session's disconnect.
292+
Moreover, after any DDL operation all prepared statements are considered to be
293+
expired. Expired entry can't be executed without re-preparation. Size of
294+
cache is configured by `box.cfg.sql_cache_size` handle. There's also statistics
295+
available in box.info:sql().cache:
296+
- `size` is a total amount of memory consumed by prepared statements;
297+
- `stmt_count` is a number of prepared statements statement.
298+
299+
When `:execute()` method is called and entry is found in prepared
300+
statement cache, it should be copied before execution. Cloning is
301+
required since byte-code can be self-modified during execution.
302+
Moreover, it allows dealing with statement duplicates, like:
303+
```
304+
stmt1 = box.prepare("SELECT 1;")
305+
stmt2 = box.prepare("SELECT 1;")
306+
307+
stmt1:unprepare()
308+
stmt2:execute()
309+
```
310+
Instead of adding query duplicates to the prepared statement cache,
311+
reference counter of corresponding prepared statement is incremented.
312+
Note that duplicates imply that several sessions can share one prepared
313+
statement. When reference counter reaches zero, prepared statement is to
314+
be deleted. What is more, copying solves another problem. During execution of
315+
huge SELECT query which may contain yields (for instance, UDF with sleep()),
316+
and if at the same time another session attempts at executing the same prepared
317+
statement, it will fail (or simply result in compile-and-execute procedure)
318+
since instance of prepared statement contains run-time atrributes (program
319+
counter, memory cells state and so on). Finally, another session firstly can
320+
invalidate prepared statement by executing DDL operation, and then re-compile
321+
statement currently being executed. The last scenarion may result in
322+
unpredictable consequences.
323+
324+
It is supposed that there's no auto-caching (at least now). It means
325+
that query can get to the prepared statement cache only if explicit
326+
`:prepare()` invokation has taken place and is invalidated only by
327+
user request (or at the end of session).
328+
329+
It is worth mentioning that each prepared statement also is assigned
330+
with schema version at the moment of its creation. If current schema version
331+
is different from prepared statement's one, error is raised saying that
332+
prepared statement is expired and it requires re-compilation.

0 commit comments

Comments
 (0)