Extended query pipeline to execute multiple independent queries in a single batch #2068

DXist · 2022-08-24T10:00:35Z

This PR adds support for pipelined query execution that uses extended query protocol to run multiple independent queries and finishes the pipeline with Sync message.

When there is no explicit transaction the pipelined queries run within an implicit transaction. In case of CockroachDB the transaction could be automatically retried on conflict as long as the database server is able to fully buffer the response data.

This image illustrates the pipelined execution but only for a single query. The implemented pipeline prepares all batched queries outside of the implicit transaction.

~~The implementation could be extended with~~

~~ExecutePgPipeline trait, implemented for transaction, connection and pool instances~~
~~FetchPgPipeline trait, implemented for transaction and connection~~

I decided not to use traits with async methods. Instead I've implemented:

execute_pipeline for PgPool and PgConnection
fetch_pipeline for PgConnection

pipeline related functionality is isolated in a dedicated module

…s used

abonander · 2022-08-25T21:06:58Z

Do you have benchmarks showing if pipelining produces a tangible improvement? I'm not convinced it's worth the extra complexity.

DXist · 2022-08-26T08:44:52Z

@abonander , I don't understand what kind of complexity do you mean?

I think pipelining doesn't need explicit 'BEGIN TRANSACTION' and 'COMMIT' statements as long as responses could be fully buffered by the database server (at least it's true for running a few INSERTs of related records atomically). So in my perspective pipelining actually a simpler approach since less Postgres protocol commands is needed.

Now I'll analyze end-to-end transaction time.

I assume cumulative query processing time for multiple single queries and a single pipeline is equal. What about communication overhead?

Pipelining depends on a single roundtrip to database for N queries in pipeline.

Running multiple queries in explicit transaction requires N + 2 round trips (1 'BEGIN TRANSACTION' roundtrip, N roundtrips for batched bind+execute+sync for each query, 1 'COMMIT' round trip). The speed of light (lower RTT bound) is finite so N + 2 round trips will take longer than a single round trip.

What about transaction retries on SERIALIZABLE transaction isolation level?

In pipeline case the database server could buffer responses. If they fit in the buffer it could postpone responses till the end of pipeline. In case of conflict it could automatically retry pipeline without involving the client. This is how CockroachDB behaves and I mainly interested to work with this DBMS.

In case of conversation-like approach with explicit transaction and queries issued one by one the transaction is longer then in the pipeline case. Longer transaction will have higher conflict probability on contention and will require retries more often. If a client received a response for the first query in transaction the database server couldn't retry transaction automatically. Client side retries would require more round trips - N for each full retry.

Does it convince you? If not I could prepare a benchmark for happy case without retries. I'll use existing docker-compose based configuration. With local network between client and db containers the pipeline approach will have less improvement (if there is any) than with real network between two machines and network devices.

abonander · 2022-08-26T21:17:38Z

Consider that you can get most of the benefits of pipelining by adjusting your query structure.

With Common Table Expressions (CTEs), you can execute multiple statements at once in the same round trip, which act as if they were independently executed, but can reference each others' outputs and have a more consistent all-or-nothing behavior:

WITH inserted_foos AS (
    INSERT INTO foo(bar_id, baz) SELECT * FROM UNNEST($1::int8[], $2::text[])
    RETURNING *
), updated_bars AS (
    -- Side-effecting queries don't need to produce a result, they will still be executed
    UPDATE bar SET foo_id = foo.id FROM inserted_foos WHERE bar.id = foo.bar_id
)
-- This query cannot actually see the changes in the `foo` table itself as it's executed against the same snapshot.
SELECT * FROM inserted_foos

If you're selecting two independent records, you can use a single query with a lateral or full outer join:

SELECT foo.*, bar.*
FROM foo
FULL JOIN bar ON bar_id = $2
WHERE foo_id = $1

If you're doing multiple, large, independent queries you're going to get better throughput from using multiple connections.

The only situation that uniquely benefits from pipelining is executing multiple independent, medium-sized (>1 row but short enough that the backend is finished before the next request arrives) queries all at once, which doesn't really come up a lot in practice.

DXist · 2022-08-27T05:59:13Z

I'll illustrate my use case.

There are User and Organization entities linked via UserOrgLink.

My service provides the following API for User operations:

create User. Optionally it's possible to link User to the given Organization after the User is created.
link User. Just link existing User to existing Organization.

I implemented these operations in modular way. There are following API structures and methods:

User::insert - it runs a simple INSERT against Users table.
LinkUserTo::insert - it runs a simple INSERT against UserOrgLinks table
MaybeLinkedUser::insert - it calls User::insert and optionally (depending on maybe_linked_user.link attribute which is Option<LinkUserTo>) calls `LinkUserTo::insert.

With pipelining I could keep this modular structure and simple queries. I could reuse simple operations in multiple contexts. I could fully express the needed control flow, can depend on parameters or state that is outside of the database.

I think Common Table Expressions is a really good option in case of dependent queries that are independent from the state outside of the database.

Regarding your example with large queries. If it's required to run them in the same transaction to get data from the single snapshot then multiple connections is not an option.

Personally I'm not interested in the large queries case. I've mentioned in the MR that the result set is relatively small. I work with CockroachDB and expect it under 16KiB.

abonander · 2022-08-30T21:44:21Z

You can use a CTE for your use-case like so:

WITH inserted_user AS (
    INSERT INTO Users(username, password_hash) 
    VALUES ($1, $2) 
    RETURNING user_id
)
INSERT INTO UserOrgLinks(org_id, user_id)
SELECT $3, user_id
FROM inserted_user
RETURNING user_id -- if you want to get the created user ID out of the query

DXist · 2022-08-31T06:13:31Z

@abonander I generate ids client side (UUIDv7).

The pipeline also contains a query that is not relevant to the User entity directly but saves the state needed for background processing by separate application worker. The processing is done outside of the API request handler.

With CTE approach I have to copy INSERTs for each scenario (in my case - for link present and link absent).

This will result

in monolitic code - User insert logic have to include the INSERT that saves additional request state
and duplicated pieces of sql - INSERT INTO Users is placed into both "link present" CTE and "link absent" CTE. More changes will be required for SQL statements when new columns are added.

In my opinion, query pipeline is a different tool that could be used instead of complex queries with CTE. Users could choose a preferable tool for each case.

abonander · 2022-08-31T22:25:34Z

If you want to keep the queries separate then there's nothing stopping you. Pipelining is just an optimization, which again I'm not entirely convinced is necessary, nor worth the tradeoff of extra cognitive load on the user (having to think about "can these queries be pipelined? If they are pipelined what happens when one of the queries errors? What order should they execute in to maintain data consistency?" etc). I'd really rather see benchmarks showing that it can produce a tangible improvement before moving forward.

And we haven't even talked about the API design in this PR, which I'm honestly not really that impressed with. There's not really any consideration given to how the user is supposed to handle queries that return data as all the results are combined into a single stream, which, yeah, that's how Executor::fetch_many() works too but that's not really designed for general use, it's the core primitive that other methods are built on.

So while I'm not giving a hard "no" to the idea of pipelining, I'm going to close this PR because the above reasons and because I don't have the energy to continue debating, sorry. I'd recommend joining the existing discussion in #408, which I've guessing you haven't seen since you didn't mention it at all. If you still want to work on those benchmarks I'd love to see your results there.

And if you're really dead-set on needing a SQL client that implements pipelining, tokio-postgres has it.

DXist · 2022-09-01T22:13:37Z

@abonander , my current needs with multiple related INSERTs and client side generated ids are fully covered by pipeline.execute(). I'll try to find some time to prepare benchmark for this method.

I agree that fetch_pipeline method is not friendly for beginner users - stream processing definitely requires some cognitive load and familiarity with combinator methods.

DXist · 2022-09-02T11:12:38Z

@abonander I generalized the transaction pipeline discussion for both explicit and implicit pipelines - #2082.

The idea is to protect application developers from dealing with concurrency issues and working with stale data. As a bonus the approach also includes collapsing queries into implicit pipelines that optimizes the number of communication rounds with database and unlocks the benefit of server-side autoretry mechanism in single implicit transaction case.

DXist · 2022-09-04T21:54:44Z

@abonander , I've rebased my branch and added 'Close' command to the pipeline implementation.

Then added pg_pipeline benchmark that runs 3 INSERT queries using pipeline and issuing them one by one.

The benchmark uses pipeline.execute() method. I tried two versions of single query execution - non transactional and transactional - with explicit BEGIN / COMMIT statements. The latter is apples-to-apples comparison.

I've got the following results (MacBook Pro 2019, local Postgres 14 in container, Docker Desktop with default hypervisor):

Non transactional:

bench_pipeline/user_post_comment_3
                        time:   [2.6775 ms 2.8063 ms 2.9559 ms]
                        change: [+0.0675% +5.5617% +11.665%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

bench_multiple_inserts/user_post_comment_3
                        time:   [4.5276 ms 4.6382 ms 4.7709 ms]
                        change: [-7.1142% -3.1412% +1.0713%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

Pipelined version completes in about ~60.5% of time spent by the one query at a time approach.

Transactional

bench_pipeline/user_post_comment_3
                        time:   [2.5346 ms 2.6107 ms 2.7074 ms]
                        change: [-12.401% -6.9702% -1.5245%] (p = 0.02 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

bench_multiple_inserts/user_post_comment_3
                        time:   [5.1436 ms 5.2735 ms 5.4387 ms]
                        change: [+9.5682% +13.698% +18.412%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

In this case pipelined version takes only ~49.5% time of explicit multi statement transaction - 2x speedup.

mplanchard · 2023-08-16T20:36:19Z

Hey FWIW we'd also really like to see this feature. We're doing really performance-sensitive DB operations, and we're adding in some use of the outbox table pattern, where you insert and delete a record from a table in the same transaction and capture it with CDC.

As a result, an operation that used to look like:

insert some record into the DB

Now looks like:

begin txn
insert some record in to the DB
insert outbox record into outbox table
delete outbox record from the outbox table
commit txn

There's no way we can do this using a CTE, b/c you can't start a transaction as part of a CTE, and you also can't modify the same row twice in a CTE, so doing something like with ins as (insert ...) delete from some_table using ins where some_table.id = ins.id doesn't work.

Even assuming a super optimistic network hop to the DB of 500 us, that increases the overhad for this operation to 2.5 ms. We can optimize a bit by putting the delete into a background thread, but at minimum the record insert and the outbox table insert have to occur within the same transaction, and we of course also need to begin and commit the transaction.

We can definitely add in an entirely separate DB driver and connection pool and use tokio-postgres to accomplish this, but it would be great to use our existing sqlx queries for model insertion and so on. I played around a fair bit with trying to get something working with sqlx::Executor::execute_many() for this, but I can't get that to work at all in postgres whatsoever, and it seems to be totally undocumented.

mplanchard · 2023-08-16T20:38:00Z

So it would be nice to either see pipelining implemented in some fashion, or execute_many() made to be workable, or otherwise any way of executing n queries without paying the cost of n network round trips

richardhenry · 2024-03-05T10:33:10Z

+1 for this feature

DXist added 5 commits August 23, 2022 20:32

Add support of pipelined extended query execution

c552c91

add pipeline test

93e9f35

return ProtocolError on query length / number of query results mismatch

0d2397e

public execute_pipeline/fetch_pipeline interface

822ebce

pipeline related functionality is isolated in a dedicated module

add pipeline doctest that conditionally compiles when tokio runtime i…

85802d5

…s used

abonander closed this Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extended query pipeline to execute multiple independent queries in a single batch #2068

Extended query pipeline to execute multiple independent queries in a single batch #2068

Uh oh!

DXist commented Aug 24, 2022 •

edited

Loading

Uh oh!

abonander commented Aug 25, 2022

Uh oh!

DXist commented Aug 26, 2022 •

edited

Loading

Uh oh!

abonander commented Aug 26, 2022 •

edited

Loading

Uh oh!

DXist commented Aug 27, 2022

Uh oh!

abonander commented Aug 30, 2022 •

edited

Loading

Uh oh!

DXist commented Aug 31, 2022

Uh oh!

abonander commented Aug 31, 2022

Uh oh!

DXist commented Sep 1, 2022

Uh oh!

DXist commented Sep 2, 2022

Uh oh!

DXist commented Sep 4, 2022 •

edited

Loading

Uh oh!

mplanchard commented Aug 16, 2023 •

edited

Loading

Uh oh!

mplanchard commented Aug 16, 2023

Uh oh!

richardhenry commented Mar 5, 2024

Uh oh!

Uh oh!

Extended query pipeline to execute multiple independent queries in a single batch #2068

Extended query pipeline to execute multiple independent queries in a single batch #2068

Uh oh!

Conversation

DXist commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abonander commented Aug 25, 2022

Uh oh!

DXist commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abonander commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DXist commented Aug 27, 2022

Uh oh!

abonander commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DXist commented Aug 31, 2022

Uh oh!

abonander commented Aug 31, 2022

Uh oh!

DXist commented Sep 1, 2022

Uh oh!

DXist commented Sep 2, 2022

Uh oh!

DXist commented Sep 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mplanchard commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mplanchard commented Aug 16, 2023

Uh oh!

richardhenry commented Mar 5, 2024

Uh oh!

Uh oh!

DXist commented Aug 24, 2022 •

edited

Loading

DXist commented Aug 26, 2022 •

edited

Loading

abonander commented Aug 26, 2022 •

edited

Loading

abonander commented Aug 30, 2022 •

edited

Loading

DXist commented Sep 4, 2022 •

edited

Loading

mplanchard commented Aug 16, 2023 •

edited

Loading