Skip to content

Extended query pipeline to execute multiple independent queries in a single batch #2068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

DXist
Copy link
Contributor

@DXist DXist commented Aug 24, 2022

This PR adds support for pipelined query execution that uses extended query protocol to run multiple independent queries and finishes the pipeline with Sync message.

When there is no explicit transaction the pipelined queries run within an implicit transaction. In case of CockroachDB the transaction could be automatically retried on conflict as long as the database server is able to fully buffer the response data.

This image illustrates the pipelined execution but only for a single query. The implemented pipeline prepares all batched queries outside of the implicit transaction.

The implementation could be extended with

  • ExecutePgPipeline trait, implemented for transaction, connection and pool instances
  • FetchPgPipeline trait, implemented for transaction and connection

I decided not to use traits with async methods. Instead I've implemented:

  • execute_pipeline for PgPool and PgConnection
  • fetch_pipeline for PgConnection

@abonander
Copy link
Collaborator

Do you have benchmarks showing if pipelining produces a tangible improvement? I'm not convinced it's worth the extra complexity.

@DXist
Copy link
Contributor Author

DXist commented Aug 26, 2022

@abonander , I don't understand what kind of complexity do you mean?

I think pipelining doesn't need explicit 'BEGIN TRANSACTION' and 'COMMIT' statements as long as responses could be fully buffered by the database server (at least it's true for running a few INSERTs of related records atomically). So in my perspective pipelining actually a simpler approach since less Postgres protocol commands is needed.

Now I'll analyze end-to-end transaction time.

I assume cumulative query processing time for multiple single queries and a single pipeline is equal. What about communication overhead?

Pipelining depends on a single roundtrip to database for N queries in pipeline.

Running multiple queries in explicit transaction requires N + 2 round trips (1 'BEGIN TRANSACTION' roundtrip, N roundtrips for batched bind+execute+sync for each query, 1 'COMMIT' round trip). The speed of light (lower RTT bound) is finite so N + 2 round trips will take longer than a single round trip.

What about transaction retries on SERIALIZABLE transaction isolation level?

In pipeline case the database server could buffer responses. If they fit in the buffer it could postpone responses till the end of pipeline. In case of conflict it could automatically retry pipeline without involving the client. This is how CockroachDB behaves and I mainly interested to work with this DBMS.

In case of conversation-like approach with explicit transaction and queries issued one by one the transaction is longer then in the pipeline case. Longer transaction will have higher conflict probability on contention and will require retries more often. If a client received a response for the first query in transaction the database server couldn't retry transaction automatically. Client side retries would require more round trips - N for each full retry.

Does it convince you? If not I could prepare a benchmark for happy case without retries. I'll use existing docker-compose based configuration. With local network between client and db containers the pipeline approach will have less improvement (if there is any) than with real network between two machines and network devices.

@abonander
Copy link
Collaborator

abonander commented Aug 26, 2022

Consider that you can get most of the benefits of pipelining by adjusting your query structure.

With Common Table Expressions (CTEs), you can execute multiple statements at once in the same round trip, which act as if they were independently executed, but can reference each others' outputs and have a more consistent all-or-nothing behavior:

WITH inserted_foos AS (
    INSERT INTO foo(bar_id, baz) SELECT * FROM UNNEST($1::int8[], $2::text[])
    RETURNING *
), updated_bars AS (
    -- Side-effecting queries don't need to produce a result, they will still be executed
    UPDATE bar SET foo_id = foo.id FROM inserted_foos WHERE bar.id = foo.bar_id
)
-- This query cannot actually see the changes in the `foo` table itself as it's executed against the same snapshot.
SELECT * FROM inserted_foos

If you're selecting two independent records, you can use a single query with a lateral or full outer join:

SELECT foo.*, bar.*
FROM foo
FULL JOIN bar ON bar_id = $2
WHERE foo_id = $1

If you're doing multiple, large, independent queries you're going to get better throughput from using multiple connections.

The only situation that uniquely benefits from pipelining is executing multiple independent, medium-sized (>1 row but short enough that the backend is finished before the next request arrives) queries all at once, which doesn't really come up a lot in practice.

@DXist
Copy link
Contributor Author

DXist commented Aug 27, 2022

I'll illustrate my use case.

There are User and Organization entities linked via UserOrgLink.

My service provides the following API for User operations:

  • create User. Optionally it's possible to link User to the given Organization after the User is created.
  • link User. Just link existing User to existing Organization.

I implemented these operations in modular way. There are following API structures and methods:

  • User::insert - it runs a simple INSERT against Users table.
  • LinkUserTo::insert - it runs a simple INSERT against UserOrgLinks table
  • MaybeLinkedUser::insert - it calls User::insert and optionally (depending on maybe_linked_user.link attribute which is Option<LinkUserTo>) calls `LinkUserTo::insert.

With pipelining I could keep this modular structure and simple queries. I could reuse simple operations in multiple contexts. I could fully express the needed control flow, can depend on parameters or state that is outside of the database.

I think Common Table Expressions is a really good option in case of dependent queries that are independent from the state outside of the database.

Regarding your example with large queries. If it's required to run them in the same transaction to get data from the single snapshot then multiple connections is not an option.

Personally I'm not interested in the large queries case. I've mentioned in the MR that the result set is relatively small. I work with CockroachDB and expect it under 16KiB.

@abonander
Copy link
Collaborator

abonander commented Aug 30, 2022

You can use a CTE for your use-case like so:

WITH inserted_user AS (
    INSERT INTO Users(username, password_hash) 
    VALUES ($1, $2) 
    RETURNING user_id
)
INSERT INTO UserOrgLinks(org_id, user_id)
SELECT $3, user_id
FROM inserted_user
RETURNING user_id -- if you want to get the created user ID out of the query

@DXist
Copy link
Contributor Author

DXist commented Aug 31, 2022

@abonander I generate ids client side (UUIDv7).

The pipeline also contains a query that is not relevant to the User entity directly but saves the state needed for background processing by separate application worker. The processing is done outside of the API request handler.

With CTE approach I have to copy INSERTs for each scenario (in my case - for link present and link absent).

This will result

  • in monolitic code - User insert logic have to include the INSERT that saves additional request state
  • and duplicated pieces of sql - INSERT INTO Users is placed into both "link present" CTE and "link absent" CTE. More changes will be required for SQL statements when new columns are added.

In my opinion, query pipeline is a different tool that could be used instead of complex queries with CTE. Users could choose a preferable tool for each case.

@abonander
Copy link
Collaborator

If you want to keep the queries separate then there's nothing stopping you. Pipelining is just an optimization, which again I'm not entirely convinced is necessary, nor worth the tradeoff of extra cognitive load on the user (having to think about "can these queries be pipelined? If they are pipelined what happens when one of the queries errors? What order should they execute in to maintain data consistency?" etc). I'd really rather see benchmarks showing that it can produce a tangible improvement before moving forward.

And we haven't even talked about the API design in this PR, which I'm honestly not really that impressed with. There's not really any consideration given to how the user is supposed to handle queries that return data as all the results are combined into a single stream, which, yeah, that's how Executor::fetch_many() works too but that's not really designed for general use, it's the core primitive that other methods are built on.

So while I'm not giving a hard "no" to the idea of pipelining, I'm going to close this PR because the above reasons and because I don't have the energy to continue debating, sorry. I'd recommend joining the existing discussion in #408, which I've guessing you haven't seen since you didn't mention it at all. If you still want to work on those benchmarks I'd love to see your results there.

And if you're really dead-set on needing a SQL client that implements pipelining, tokio-postgres has it.

@abonander abonander closed this Aug 31, 2022
@DXist
Copy link
Contributor Author

DXist commented Sep 1, 2022

@abonander , my current needs with multiple related INSERTs and client side generated ids are fully covered by pipeline.execute(). I'll try to find some time to prepare benchmark for this method.

I agree that fetch_pipeline method is not friendly for beginner users - stream processing definitely requires some cognitive load and familiarity with combinator methods.

@DXist
Copy link
Contributor Author

DXist commented Sep 2, 2022

@abonander I generalized the transaction pipeline discussion for both explicit and implicit pipelines - #2082.

The idea is to protect application developers from dealing with concurrency issues and working with stale data. As a bonus the approach also includes collapsing queries into implicit pipelines that optimizes the number of communication rounds with database and unlocks the benefit of server-side autoretry mechanism in single implicit transaction case.

@DXist
Copy link
Contributor Author

DXist commented Sep 4, 2022

@abonander , I've rebased my branch and added 'Close' command to the pipeline implementation.

Then added pg_pipeline benchmark that runs 3 INSERT queries using pipeline and issuing them one by one.

The benchmark uses pipeline.execute() method. I tried two versions of single query execution - non transactional and transactional - with explicit BEGIN / COMMIT statements. The latter is apples-to-apples comparison.

I've got the following results (MacBook Pro 2019, local Postgres 14 in container, Docker Desktop with default hypervisor):

Non transactional:

bench_pipeline/user_post_comment_3
                        time:   [2.6775 ms 2.8063 ms 2.9559 ms]
                        change: [+0.0675% +5.5617% +11.665%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

bench_multiple_inserts/user_post_comment_3
                        time:   [4.5276 ms 4.6382 ms 4.7709 ms]
                        change: [-7.1142% -3.1412% +1.0713%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

Pipelined version completes in about ~60.5% of time spent by the one query at a time approach.

Transactional

bench_pipeline/user_post_comment_3
                        time:   [2.5346 ms 2.6107 ms 2.7074 ms]
                        change: [-12.401% -6.9702% -1.5245%] (p = 0.02 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

bench_multiple_inserts/user_post_comment_3
                        time:   [5.1436 ms 5.2735 ms 5.4387 ms]
                        change: [+9.5682% +13.698% +18.412%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

In this case pipelined version takes only ~49.5% time of explicit multi statement transaction - 2x speedup.

@mplanchard
Copy link

mplanchard commented Aug 16, 2023

Hey FWIW we'd also really like to see this feature. We're doing really performance-sensitive DB operations, and we're adding in some use of the outbox table pattern, where you insert and delete a record from a table in the same transaction and capture it with CDC.

As a result, an operation that used to look like:

  • insert some record into the DB

Now looks like:

  • begin txn
  • insert some record in to the DB
  • insert outbox record into outbox table
  • delete outbox record from the outbox table
  • commit txn

There's no way we can do this using a CTE, b/c you can't start a transaction as part of a CTE, and you also can't modify the same row twice in a CTE, so doing something like with ins as (insert ...) delete from some_table using ins where some_table.id = ins.id doesn't work.

Even assuming a super optimistic network hop to the DB of 500 us, that increases the overhad for this operation to 2.5 ms. We can optimize a bit by putting the delete into a background thread, but at minimum the record insert and the outbox table insert have to occur within the same transaction, and we of course also need to begin and commit the transaction.

We can definitely add in an entirely separate DB driver and connection pool and use tokio-postgres to accomplish this, but it would be great to use our existing sqlx queries for model insertion and so on. I played around a fair bit with trying to get something working with sqlx::Executor::execute_many() for this, but I can't get that to work at all in postgres whatsoever, and it seems to be totally undocumented.

@mplanchard
Copy link

So it would be nice to either see pipelining implemented in some fashion, or execute_many() made to be workable, or otherwise any way of executing n queries without paying the cost of n network round trips

@richardhenry
Copy link

+1 for this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants