Fix Statement.cancel() #37472

MilanTyagi2004 · 2025-12-23T08:58:28Z

Fixes #37404

Changes proposed in this pull request:

In Apache ShardingSphere Proxy using the MySQL protocol, cancelling a running SQL statement does not correctly interrupt
the underlying JDBC Statement in a cluster environment.
Although a Process instance is created for the execution, the MySQL thread ID is not bound to the connection session. As a
result, KILL QUERY / Statement.cancel() cannot locate or terminate the actual running MySQL query.
This pull request binds the MySQL thread ID to the connection session after authentication, ensuring that cancellation signals
are correctly propagated to the JDBC layer.

Expected behavior

Statement.cancel() or KILL QUERY should terminate the running MySQL query.

Actual behavior

The query continues running even after cancellation is requested.

Environment

ShardingSphere version: 5.5.3-SNAPSHOT
OS: Windows 11
Database: MySQL
Mode: Cluster mode

Before committing this PR, I'm sure that I have checked the following options:

My code follows the code of conduct of this project.
I have self-reviewed the commit code.
I have (or in comment I request) added corresponding labels for the pull request.
I have verified the changes locally with Maven (module-level and targeted builds)..
I have made corresponding changes to the documentation.
I have added corresponding unit tests for my changes.
I have updated the Release Notes of the current development version. For more details, see Update Release Note

MilanTyagi2004 · 2025-12-23T09:00:50Z

@terrymanu Sorry , the previous pr was accidently closed.
Please approve workflow and Review this when you have time.
thankyou for your understanding.

MilanTyagi2004 · 2025-12-23T13:28:01Z

@terrymanu Please review this.
Core changes are in place; a small part of the implementation is still left and will be added next.

terrymanu

The current PR has blocking issues and shouldn’t be merged:

Root cause not fixed: The handshake exposes a 32-bit connection ID while Process stores a 64-bit one. On MySQL, ProcessRegistry matches the full 64-bit ID and MySQLKillProcessExecutor relies on node bits, so the client’s 32-bit ID never matches and may be treated as
“remote,” causing cancel/KILL to fail. On PostgreSQL, BackendKeyData only carries a 32-bit processId, but Process stores only a 64-bit connectionId and never sets postgresqlProcessId; CancelRequest looks up by 32-bit, so in a cluster it can’t find the target.
Cluster forwarding gap: KillProcessHandler parses IDs with Integer.parseInt; 64-bit IDs with node bits are dropped and never forwarded to the right node.
Persistence gap: YamlProcess/swapper only store mysqlThreadId/postgresqlProcessId and omit the 64-bit connection ID and secretKey, so cross-node cancel/kill lacks critical data.
Change surface too large: 71 files touched, with many unrelated format/alignment tweaks (e.g., Firebird/OpenGauss frontend), which inflates the diff and review cost.

Suggested direction: (1) Align the handshake identifier with what Process stores—either make the handshake ID reconstruct node bits or map the 32-bit handshake ID back to the 64-bit stored value. (2) PostgreSQL should store both 32-bit postgresqlProcessId and 64-bit
postgresqlConnectionId; Cancel should match on 32-bit, and both the 64-bit ID and secretKey should be persisted. (3) KillProcessHandler must accept and forward 64-bit connection IDs. (4) Extend YamlProcess/swapper to include 64-bit connection IDs and secretKey. (5) Trim
this PR to only the necessary changes, remove unrelated formatting tweaks, and add tests covering handshake → process bind → cancel/KILL for local and cross-node paths (32/64-bit, secretKey mismatch).

In its current form, the PR should not be merged.

MilanTyagi2004 · 2025-12-23T13:58:10Z

The current PR has blocking issues and shouldn’t be merged:

Root cause not fixed: The handshake exposes a 32-bit connection ID while Process stores a 64-bit one. On MySQL, ProcessRegistry matches the full 64-bit ID and MySQLKillProcessExecutor relies on node bits, so the client’s 32-bit ID never matches and may be treated as
“remote,” causing cancel/KILL to fail. On PostgreSQL, BackendKeyData only carries a 32-bit processId, but Process stores only a 64-bit connectionId and never sets postgresqlProcessId; CancelRequest looks up by 32-bit, so in a cluster it can’t find the target.

Cluster forwarding gap: KillProcessHandler parses IDs with Integer.parseInt; 64-bit IDs with node bits are dropped and never forwarded to the right node.

Persistence gap: YamlProcess/swapper only store mysqlThreadId/postgresqlProcessId and omit the 64-bit connection ID and secretKey, so cross-node cancel/kill lacks critical data.

Change surface too large: 71 files touched, with many unrelated format/alignment tweaks (e.g., Firebird/OpenGauss frontend), which inflates the diff and review cost.

Suggested direction: (1) Align the handshake identifier with what Process stores—either make the handshake ID reconstruct node bits or map the 32-bit handshake ID back to the 64-bit stored value. (2) PostgreSQL should store both 32-bit postgresqlProcessId and 64-bit postgresqlConnectionId; Cancel should match on 32-bit, and both the 64-bit ID and secretKey should be persisted. (3) KillProcessHandler must accept and forward 64-bit connection IDs. (4) Extend YamlProcess/swapper to include 64-bit connection IDs and secretKey. (5) Trim this PR to only the necessary changes, remove unrelated formatting tweaks, and add tests covering handshake → process bind → cancel/KILL for local and cross-node paths (32/64-bit, secretKey mismatch).

In its current form, the PR should not be merged.

Thankyou for the review and i will surely look into this

MilanTyagi2004 · 2025-12-23T19:05:30Z

hi,@terrymanu i have an doubt ,connection id is pass through firebird/opengaus , and the function written in these are in int
and we have to take 64 bitId
either we have to lower the bits bit using (int) which makes multiple touch file.
any guidance or suggestion

ShenFeng312 · 2025-12-24T03:50:08Z

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur.
I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

ShenFeng312 · 2025-12-24T04:50:21Z

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

I will open a new issue to discuss this problem. In the meantime, I think we should pause any further work on this PR.

MilanTyagi2004 · 2025-12-24T05:08:46Z

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

I will open a new issue to discuss this problem. In the meantime, I think we should pause any further work on this PR.

Please give me some time ,I am working on this issue

MilanTyagi2004 · 2025-12-24T05:19:57Z

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

I will open a new issue to discuss this problem. In the meantime, I think we should pause any further work on this PR.

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

Thanks for pointing this out. I agree this scenario can lead to duplicate IDs when auto-increment overflows and some connections are still active (especially with mixed pooled and direct connections).

To handle this safely, I used a BitSet to track allocated connection IDs up to the maximum allowed connections. On allocation, we pick the next clear bit, and on release, we clear it. This ensures an ID is reused only after the original connection is fully released, even after overflow.

This avoids collisions, works well with connection pools, and keeps the behavior deterministic.

MilanTyagi2004 · 2025-12-24T07:32:29Z

@terrymanu what are you thoughts about this

terrymanu · 2025-12-24T12:18:14Z

We’ve overcomplicated this. The real goal is just to make Kill/Cancel hit the running process. The chatter exploded because the scope ballooned: we mixed in 64-bit global IDs, cross-node forwarding, persistence, and even Firebird/OpenGauss while the client protocol still exposes 32-bit IDs, and we never nailed a clear mapping. On top of that, we touched 70+ files before agreeing on the design, so formatting/compat issues drowned out the core fix.

Let’s simplify and get back on target:

Pick one rule and make “handshake ID == lookup ID” work end to end. Either stay with 32-bit and align MySQL/PostgreSQL handshake, ProcessRegistry, and Kill/Cancel, or if we keep 64-bit, define an exact mapping so the 32-bit client ID can resolve to the stored 64-bit.
Fix the main MySQL/PostgreSQL path only: handshake → process bind → registry lookup → Kill/Cancel. Don’t fan out to other protocols yet.
Pause unrelated formatting/refactors and shrink the diff.
Add a few focused tests: local/(if needed) cross-node, secret key checks, and the ID mapping.

First make the core path reliable; discuss global IDs/bitset allocation separately afterward.

MilanTyagi2004 · 2025-12-24T13:10:41Z

We’ve overcomplicated this. The real goal is just to make Kill/Cancel hit the running process. The chatter exploded because the scope ballooned: we mixed in 64-bit global IDs, cross-node forwarding, persistence, and even Firebird/OpenGauss while the client protocol still exposes 32-bit IDs, and we never nailed a clear mapping. On top of that, we touched 70+ files before agreeing on the design, so formatting/compat issues drowned out the core fix.

Let’s simplify and get back on target:

Pick one rule and make “handshake ID == lookup ID” work end to end. Either stay with 32-bit and align MySQL/PostgreSQL handshake, ProcessRegistry, and Kill/Cancel, or if we keep 64-bit, define an exact mapping so the 32-bit client ID can resolve to the stored 64-bit.

Fix the main MySQL/PostgreSQL path only: handshake → process bind → registry lookup → Kill/Cancel. Don’t fan out to other protocols yet.

Pause unrelated formatting/refactors and shrink the diff.

Add a few focused tests: local/(if needed) cross-node, secret key checks, and the ID mapping.

First make the core path reliable; discuss global IDs/bitset allocation separately afterward.

Thanks for the clarification. I agree the scope grew too large and diluted the core fix.

I’ll reset this PR to focus only on the essential path:
handshake → process bind → registry lookup → Kill/Cancel, limited to MySQL and PostgreSQL.

Plan:

Pick a single rule so handshake ID == lookup ID end-to-end. I’ll stay with 32-bit IDs for now and align handshake,
ProcessRegistry, and Kill/Cancel accordingly.
Remove all unrelated changes (Firebird/OpenGauss, formatting, refactors) and shrink the diff.

Add focused tests covering:
handshake → bind → cancel/KILL (local)

PostgreSQL secretKey validation
Defer 64-bit global IDs, cross-node forwarding, persistence, and allocation/bitset discussions to a separate issue.

I’ll update the PR with a minimal, clean diff following this approach.

MilanTyagi2004 · 2025-12-24T13:49:08Z

I’ve reduced the PR to only the process ID binding during authentication, removing all unrelated changes to keep the scope minimal and focused.

MilanTyagi2004 · 2025-12-24T15:30:20Z

This change fixes the MySQL handshake ID mismatch on the executing node, which was the root cause of KILL / Statement.cancel() failing in cluster mode. Cross-node routing is intentionally not addressed here and can be discussed separately.

MilanTyagi2004 · 2025-12-25T03:12:48Z

@terrymanu please review the pr. So i can work on the cross node routing.

terrymanu

Thanks for the effort—good direction, but the current patch doesn’t solve the root cause and introduces new issues. Please refine and resubmit:

Design/abstraction issue: Process is a cross-protocol infra class. Adding mysqlThreadId and MySQL-specific lookup into Process/ProcessRegistry bakes dialect details into shared code. Please revert these changes and handle MySQL handshake thread id binding at the MySQL protocol/session layer, or introduce a protocol-agnostic “native connection identifier” abstraction instead.
Root cause still unsolved: The handshake thread id remains node-local and not cluster-unique; processId is still not aligned with the handshake id. During execution, ProcessEngine.executeSQL/completeSQLExecution recreates Process with the same processId, dropping the thread id you set at handshake time, so KILL can still fail to find the running process. Please first ensure the handshake id is cluster-unique (or can route to the right node) and keep it associated with the processId across the full lifecycle before wiring cancel/kill.
Newly introduced risks: The numeric thread-id scan in ProcessRegistry.get may kill the wrong session if thread ids collide across nodes; the new test sets threadId manually and doesn’t cover the real handshake→execute→kill flow, and it leaves the singleton ProcessRegistry state dirty. Please fix these or roll them back.
Unrelated/incorrect changes to revert: Remove the MySQL-specific fields and lookup logic from the infra Process/ProcessRegistry layer; redo the binding and routing in the MySQL protocol layer.

I encourage you to continue: make the handshake connection/thread id cluster-unique, keep it consistently mapped to the processId through the execution lifecycle, and add an end-to-end test covering handshake → execute → KILL/cancel. Looking forward to the updated version.

terrymanu requested changes Dec 23, 2025

View reviewed changes

Bind process id to connection session after authentication

0d59385

MilanTyagi2004 force-pushed the mysql-connector-j branch from 4420755 to 0d59385 Compare December 24, 2025 13:47

Fix MySQL cancel lookup by handshake id

b59a262

MilanTyagi2004 requested a review from terrymanu December 24, 2025 16:30

terrymanu requested changes Dec 25, 2025

View reviewed changes

Fix Statement.cancel() #37472

Are you sure you want to change the base?

Fix Statement.cancel() #37472

Conversation

MilanTyagi2004 commented Dec 23, 2025

Uh oh!

MilanTyagi2004 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MilanTyagi2004 commented Dec 23, 2025

Uh oh!

terrymanu left a comment

Choose a reason for hiding this comment

Uh oh!

MilanTyagi2004 commented Dec 23, 2025

Uh oh!

MilanTyagi2004 commented Dec 23, 2025

Uh oh!

ShenFeng312 commented Dec 24, 2025

Uh oh!

ShenFeng312 commented Dec 24, 2025

Uh oh!

MilanTyagi2004 commented Dec 24, 2025

Uh oh!

MilanTyagi2004 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MilanTyagi2004 commented Dec 24, 2025

Uh oh!

terrymanu commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MilanTyagi2004 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MilanTyagi2004 commented Dec 24, 2025

Uh oh!

MilanTyagi2004 commented Dec 24, 2025

Uh oh!

MilanTyagi2004 commented Dec 25, 2025

Uh oh!

terrymanu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MilanTyagi2004 commented Dec 23, 2025 •

edited

Loading

MilanTyagi2004 commented Dec 24, 2025 •

edited

Loading

terrymanu commented Dec 24, 2025 •

edited

Loading

MilanTyagi2004 commented Dec 24, 2025 •

edited

Loading

terrymanu left a comment •

edited

Loading