Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Do we care about big comments in our SQL queries? #14192

Closed
MadLittleMods opened this issue Oct 15, 2022 · 3 comments
Closed

Do we care about big comments in our SQL queries? #14192

MadLittleMods opened this issue Oct 15, 2022 · 3 comments
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db T-Other Questions, user support, anything else.

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Oct 15, 2022

As brought up by @erikjohnston in #13816 (comment), it's may not totally ideal to have these big SQL comments in the query we send over the wire to Postgres/SQLite.

I have been in the habit of writing some pretty verbose comments in the SQL queries I have been working on which provide context and answer the "why" around we are doing certain things. I find these comments extremely useful for breaking down a SQL query into something understandable and explaining intentions in case of bugs down the road. For example, the following explains the long-lost context around why we care about looking at non-state edges.

/* We only care about non-state edges because we used to use
* `event_edges` for two different sorts of "edges" (the current
* event DAG, but also a link to the previous state, for state
* events). These legacy state event edges can be distinguished by
* `is_state` and are removed from the codebase and schema but
* because the schema change is in a background update, it's not
* necessarily safe to assume that it will have been completed.
*/
AND edge.is_state is ? /* False */

These comments could be above the SQL query in comments but I feel like they make a lot more sense mixed in the query itself. Some of the comments would be hard to tie to the query statements if they were just above in a big block of paragraphs above. See this full SQL query from the example above. That's a lot of comments to try to mentally piece back to the query if it were above:

sql = f"""
SELECT backward_extrem.event_id, event.depth FROM events AS event
/**
* Get the edge connections from the event_edges table
* so we can see whether this event's prev_events points
* to a backward extremity in the next join.
*/
INNER JOIN event_edges AS edge
ON edge.event_id = event.event_id
/**
* We find the "oldest" events in the room by looking for
* events connected to backwards extremeties (oldest events
* in the room that we know of so far).
*/
INNER JOIN event_backward_extremities AS backward_extrem
ON edge.prev_event_id = backward_extrem.event_id
/**
* We use this info to make sure we don't retry to use a backfill point
* if we've already attempted to backfill from it recently.
*/
LEFT JOIN event_failed_pull_attempts AS failed_backfill_attempt_info
ON
failed_backfill_attempt_info.room_id = backward_extrem.room_id
AND failed_backfill_attempt_info.event_id = backward_extrem.event_id
WHERE
backward_extrem.room_id = ?
/* We only care about non-state edges because we used to use
* `event_edges` for two different sorts of "edges" (the current
* event DAG, but also a link to the previous state, for state
* events). These legacy state event edges can be distinguished by
* `is_state` and are removed from the codebase and schema but
* because the schema change is in a background update, it's not
* necessarily safe to assume that it will have been completed.
*/
AND edge.is_state is ? /* False */
/**
* We only want backwards extremities that are older than or at
* the same position of the given `current_depth` (where older
* means less than the given depth) because we're looking backwards
* from the `current_depth` when backfilling.
*
* current_depth (ignore events that come after this, ignore 2-4)
* |
* ▼
* <oldest-in-time> [0]<--[1]<--[2]<--[3]<--[4] <newest-in-time>
*/
AND event.depth <= ? /* current_depth */
/**
* Exponential back-off (up to the upper bound) so we don't retry the
* same backfill point over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND (
failed_backfill_attempt_info.event_id IS NULL
OR ? /* current_time */ >= failed_backfill_attempt_info.last_attempt_ts + (
(1 << {least_function}(failed_backfill_attempt_info.num_attempts, ? /* max doubling steps */))
* ? /* step */
)
)
/**
* Sort from highest (closest to the `current_depth`) to the lowest depth
* because the closest are most relevant to backfill from first.
* Then tie-break on alphabetical order of the event_ids so we get a
* consistent ordering which is nice when asserting things in tests.
*/
ORDER BY event.depth DESC, backward_extrem.event_id DESC
LIMIT ?
"""

Potential solutions

1. Does it even matter?

Do the extra comments even matter at all? Does it really affect performance in any meaningful way, whether it be over the wire, inside the SQL engines, etc.

Surely, the extra indentation spaces from the Python """ multi-line strings add just as many bytes if we're worried about that.

2. Strip comments out of the query

Can we get away with a quick and greedy find/replace of everything between /* ... */? Probably not totally safe but it probably works for our queries 🤷‍♀️

Find/replace for every call doesn't sound like the best. There are ways we could memoize and cache the stripped result though.

We already strip out the newlines and trim the whitespace from each line in _make_sql_one_line but that's only used in the logger and doesn't look particularly optimized by eye.

3. Use adjacent string literals to mimic the same multiline string

Adjacent string literals are concatenated together when inside parenthesis which mimics the """ multiline string pretty well. This gives us the benefit of spreading an SQL query over multiple lines while putting the comments in between each line as before.

This solution kinda seems the best if we're worried about it. Just slightly more boilerplate faff to wrangle " and including the space for when it's concatenated together.

sql = (
"SELECT e.stream_ordering, e.event_id, e.room_id, e.type,"
" se.state_key, redacts, relates_to_id, membership, rejections.reason IS NOT NULL,"
" e.outlier"
" FROM events AS e"
" LEFT JOIN redactions USING (event_id)"
" LEFT JOIN state_events AS se USING (event_id)"
" LEFT JOIN event_relations USING (event_id)"
" LEFT JOIN room_memberships USING (event_id)"
" LEFT JOIN rejections USING (event_id)"
" WHERE ? < stream_ordering AND stream_ordering <= ?"
" AND instance_name = ?"
" ORDER BY stream_ordering ASC"
" LIMIT ?"

@MadLittleMods MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label Oct 15, 2022
@clokep
Copy link
Member

clokep commented Oct 17, 2022

Surely, the extra indentation spaces from the Python """ multi-line strings add just as many bytes if we're worried about that.

I think these get de-dented by the engine before being sent over the wire.

Adjacent string literals are concatenated together when inside parenthesis which mimics the """ multiline string pretty well. This gives us the benefit of spreading an SQL query over multiple lines while putting the comments in between each line as before.

Personally I find these much harder to read and reason about. You need to consider multiple "languages" at the same time (both the Python string syntax + SQL).

I think you're missing an option 4, which I would be in favor of: move the comments out of the SQL statement. I find the verbose comments hide the query itself, making it hard to see what is actually happening. I prefer having a long explanation above and then being able to see the query itself clearly.

@erikjohnston
Copy link
Member

Drive-by comment:

  1. Use adjacent string literals to mimic the same multiline string

This makes it really difficult to actually test the queries as you can no longer easily copy and paste them. (I do it surprisingly frequently)

@DMRobertson DMRobertson added the T-Other Questions, user support, anything else. label Oct 17, 2022
@MadLittleMods
Copy link
Contributor Author

I think you're missing an option 4, which I would be in favor of: move the comments out of the SQL statement. I find the verbose comments hide the query itself, making it hard to see what is actually happening. I prefer having a long explanation above and then being able to see the query itself clearly.

It's mentioned in the issue description but has its detractor points listed there as well so I didn't include it as a potential solution for my preference.

I see that your preference is the opposite though.

Personally I find these much harder to read and reason about. You need to consider multiple "languages" at the same time (both the Python string syntax + SQL).

This makes it really difficult to actually test the queries as you can no longer easily copy and paste them. (I do it surprisingly frequently)

I hate the adjacent string literals too. Just seemed the closest to what we currently have while keeping the comments out of the query itself.



We discussed this in the backend team sync this morning and people weren't really worried about it. @DMRobertson added onto the point that extra bytes probably isn't the bottleneck and if it is @richvdh mentioned that we should be using prepared statements anyway.

Since this was more a theoretical point and no real smoke to point at solving in the first place, I am going to close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db T-Other Questions, user support, anything else.
Projects
None yet
Development

No branches or pull requests

4 participants