-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Extended query pipeline to execute multiple independent queries in a single batch #2068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pipeline related functionality is isolated in a dedicated module
Do you have benchmarks showing if pipelining produces a tangible improvement? I'm not convinced it's worth the extra complexity. |
@abonander , I don't understand what kind of complexity do you mean? I think pipelining doesn't need explicit 'BEGIN TRANSACTION' and 'COMMIT' statements as long as responses could be fully buffered by the database server (at least it's true for running a few INSERTs of related records atomically). So in my perspective pipelining actually a simpler approach since less Postgres protocol commands is needed. Now I'll analyze end-to-end transaction time. I assume cumulative query processing time for multiple single queries and a single pipeline is equal. What about communication overhead? Pipelining depends on a single roundtrip to database for N queries in pipeline. Running multiple queries in explicit transaction requires N + 2 round trips (1 'BEGIN TRANSACTION' roundtrip, N roundtrips for batched bind+execute+sync for each query, 1 'COMMIT' round trip). The speed of light (lower RTT bound) is finite so N + 2 round trips will take longer than a single round trip. What about transaction retries on SERIALIZABLE transaction isolation level? In pipeline case the database server could buffer responses. If they fit in the buffer it could postpone responses till the end of pipeline. In case of conflict it could automatically retry pipeline without involving the client. This is how CockroachDB behaves and I mainly interested to work with this DBMS. In case of conversation-like approach with explicit transaction and queries issued one by one the transaction is longer then in the pipeline case. Longer transaction will have higher conflict probability on contention and will require retries more often. If a client received a response for the first query in transaction the database server couldn't retry transaction automatically. Client side retries would require more round trips - N for each full retry. Does it convince you? If not I could prepare a benchmark for happy case without retries. I'll use existing docker-compose based configuration. With local network between client and db containers the pipeline approach will have less improvement (if there is any) than with real network between two machines and network devices. |
Consider that you can get most of the benefits of pipelining by adjusting your query structure. With Common Table Expressions (CTEs), you can execute multiple statements at once in the same round trip, which act as if they were independently executed, but can reference each others' outputs and have a more consistent all-or-nothing behavior: WITH inserted_foos AS (
INSERT INTO foo(bar_id, baz) SELECT * FROM UNNEST($1::int8[], $2::text[])
RETURNING *
), updated_bars AS (
-- Side-effecting queries don't need to produce a result, they will still be executed
UPDATE bar SET foo_id = foo.id FROM inserted_foos WHERE bar.id = foo.bar_id
)
-- This query cannot actually see the changes in the `foo` table itself as it's executed against the same snapshot.
SELECT * FROM inserted_foos If you're selecting two independent records, you can use a single query with a lateral or full outer join: SELECT foo.*, bar.*
FROM foo
FULL JOIN bar ON bar_id = $2
WHERE foo_id = $1 If you're doing multiple, large, independent queries you're going to get better throughput from using multiple connections. The only situation that uniquely benefits from pipelining is executing multiple independent, medium-sized (>1 row but short enough that the backend is finished before the next request arrives) queries all at once, which doesn't really come up a lot in practice. |
I'll illustrate my use case. There are User and Organization entities linked via UserOrgLink. My service provides the following API for User operations:
I implemented these operations in modular way. There are following API structures and methods:
With pipelining I could keep this modular structure and simple queries. I could reuse simple operations in multiple contexts. I could fully express the needed control flow, can depend on parameters or state that is outside of the database. I think Common Table Expressions is a really good option in case of dependent queries that are independent from the state outside of the database. Regarding your example with large queries. If it's required to run them in the same transaction to get data from the single snapshot then multiple connections is not an option. Personally I'm not interested in the large queries case. I've mentioned in the MR that the result set is relatively small. I work with CockroachDB and expect it under 16KiB. |
You can use a CTE for your use-case like so: WITH inserted_user AS (
INSERT INTO Users(username, password_hash)
VALUES ($1, $2)
RETURNING user_id
)
INSERT INTO UserOrgLinks(org_id, user_id)
SELECT $3, user_id
FROM inserted_user
RETURNING user_id -- if you want to get the created user ID out of the query |
@abonander I generate ids client side (UUIDv7). The pipeline also contains a query that is not relevant to the User entity directly but saves the state needed for background processing by separate application worker. The processing is done outside of the API request handler. With CTE approach I have to copy INSERTs for each scenario (in my case - for link present and link absent). This will result
In my opinion, query pipeline is a different tool that could be used instead of complex queries with CTE. Users could choose a preferable tool for each case. |
If you want to keep the queries separate then there's nothing stopping you. Pipelining is just an optimization, which again I'm not entirely convinced is necessary, nor worth the tradeoff of extra cognitive load on the user (having to think about "can these queries be pipelined? If they are pipelined what happens when one of the queries errors? What order should they execute in to maintain data consistency?" etc). I'd really rather see benchmarks showing that it can produce a tangible improvement before moving forward. And we haven't even talked about the API design in this PR, which I'm honestly not really that impressed with. There's not really any consideration given to how the user is supposed to handle queries that return data as all the results are combined into a single stream, which, yeah, that's how So while I'm not giving a hard "no" to the idea of pipelining, I'm going to close this PR because the above reasons and because I don't have the energy to continue debating, sorry. I'd recommend joining the existing discussion in #408, which I've guessing you haven't seen since you didn't mention it at all. If you still want to work on those benchmarks I'd love to see your results there. And if you're really dead-set on needing a SQL client that implements pipelining, tokio-postgres has it. |
@abonander , my current needs with multiple related INSERTs and client side generated ids are fully covered by I agree that |
@abonander I generalized the transaction pipeline discussion for both explicit and implicit pipelines - #2082. The idea is to protect application developers from dealing with concurrency issues and working with stale data. As a bonus the approach also includes collapsing queries into implicit pipelines that optimizes the number of communication rounds with database and unlocks the benefit of server-side autoretry mechanism in single implicit transaction case. |
@abonander , I've rebased my branch and added 'Close' command to the pipeline implementation. Then added pg_pipeline benchmark that runs 3 INSERT queries using pipeline and issuing them one by one. The benchmark uses I've got the following results (MacBook Pro 2019, local Postgres 14 in container, Docker Desktop with default hypervisor): Non transactional:
Pipelined version completes in about ~60.5% of time spent by the one query at a time approach. Transactional
In this case pipelined version takes only ~49.5% time of explicit multi statement transaction - 2x speedup. |
Hey FWIW we'd also really like to see this feature. We're doing really performance-sensitive DB operations, and we're adding in some use of the outbox table pattern, where you insert and delete a record from a table in the same transaction and capture it with CDC. As a result, an operation that used to look like:
Now looks like:
There's no way we can do this using a CTE, b/c you can't start a transaction as part of a CTE, and you also can't modify the same row twice in a CTE, so doing something like Even assuming a super optimistic network hop to the DB of 500 us, that increases the overhad for this operation to 2.5 ms. We can optimize a bit by putting the delete into a background thread, but at minimum the record insert and the outbox table insert have to occur within the same transaction, and we of course also need to begin and commit the transaction. We can definitely add in an entirely separate DB driver and connection pool and use |
So it would be nice to either see pipelining implemented in some fashion, or |
+1 for this feature |
This PR adds support for pipelined query execution that uses extended query protocol to run multiple independent queries and finishes the pipeline with
Sync
message.When there is no explicit transaction the pipelined queries run within an implicit transaction. In case of CockroachDB the transaction could be automatically retried on conflict as long as the database server is able to fully buffer the response data.
This image illustrates the pipelined execution but only for a single query. The implemented pipeline prepares all batched queries outside of the implicit transaction.

The implementation could be extended withExecutePgPipeline trait, implemented for transaction, connection and pool instancesFetchPgPipeline trait, implemented for transaction and connectionI decided not to use traits with async methods. Instead I've implemented:
execute_pipeline
forPgPool
andPgConnection
fetch_pipeline
forPgConnection