Skip to content

Conversation

@MilanTyagi2004
Copy link

Fixes #37404

Changes proposed in this pull request:

  • In Apache ShardingSphere Proxy using the MySQL protocol, cancelling a running SQL statement does not correctly interrupt
    the underlying JDBC Statement in a cluster environment.

  • Although a Process instance is created for the execution, the MySQL thread ID is not bound to the connection session. As a
    result, KILL QUERY / Statement.cancel() cannot locate or terminate the actual running MySQL query.

  • This pull request binds the MySQL thread ID to the connection session after authentication, ensuring that cancellation signals
    are correctly propagated to the JDBC layer.

Expected behavior

  • Statement.cancel() or KILL QUERY should terminate the running MySQL query.

Actual behavior

  • The query continues running even after cancellation is requested.

Environment

  • ShardingSphere version: 5.5.3-SNAPSHOT

  • OS: Windows 11

  • Database: MySQL

  • Mode: Cluster mode


Before committing this PR, I'm sure that I have checked the following options:

  • My code follows the code of conduct of this project.
  • I have self-reviewed the commit code.
  • I have (or in comment I request) added corresponding labels for the pull request.
  • I have verified the changes locally with Maven (module-level and targeted builds)..
  • I have made corresponding changes to the documentation.
  • I have added corresponding unit tests for my changes.
  • I have updated the Release Notes of the current development version. For more details, see Update Release Note

@MilanTyagi2004
Copy link
Author

MilanTyagi2004 commented Dec 23, 2025

@terrymanu Sorry , the previous pr was accidently closed.
Please approve workflow and Review this when you have time.
thankyou for your understanding.

@MilanTyagi2004
Copy link
Author

@terrymanu Please review this.
Core changes are in place; a small part of the implementation is still left and will be added next.

Copy link
Member

@terrymanu terrymanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR has blocking issues and shouldn’t be merged:

  • Root cause not fixed: The handshake exposes a 32-bit connection ID while Process stores a 64-bit one. On MySQL, ProcessRegistry matches the full 64-bit ID and MySQLKillProcessExecutor relies on node bits, so the client’s 32-bit ID never matches and may be treated as
    “remote,” causing cancel/KILL to fail. On PostgreSQL, BackendKeyData only carries a 32-bit processId, but Process stores only a 64-bit connectionId and never sets postgresqlProcessId; CancelRequest looks up by 32-bit, so in a cluster it can’t find the target.
  • Cluster forwarding gap: KillProcessHandler parses IDs with Integer.parseInt; 64-bit IDs with node bits are dropped and never forwarded to the right node.
  • Persistence gap: YamlProcess/swapper only store mysqlThreadId/postgresqlProcessId and omit the 64-bit connection ID and secretKey, so cross-node cancel/kill lacks critical data.
  • Change surface too large: 71 files touched, with many unrelated format/alignment tweaks (e.g., Firebird/OpenGauss frontend), which inflates the diff and review cost.

Suggested direction: (1) Align the handshake identifier with what Process stores—either make the handshake ID reconstruct node bits or map the 32-bit handshake ID back to the 64-bit stored value. (2) PostgreSQL should store both 32-bit postgresqlProcessId and 64-bit
postgresqlConnectionId; Cancel should match on 32-bit, and both the 64-bit ID and secretKey should be persisted. (3) KillProcessHandler must accept and forward 64-bit connection IDs. (4) Extend YamlProcess/swapper to include 64-bit connection IDs and secretKey. (5) Trim
this PR to only the necessary changes, remove unrelated formatting tweaks, and add tests covering handshake → process bind → cancel/KILL for local and cross-node paths (32/64-bit, secretKey mismatch).

In its current form, the PR should not be merged.

@MilanTyagi2004
Copy link
Author

The current PR has blocking issues and shouldn’t be merged:

  • Root cause not fixed: The handshake exposes a 32-bit connection ID while Process stores a 64-bit one. On MySQL, ProcessRegistry matches the full 64-bit ID and MySQLKillProcessExecutor relies on node bits, so the client’s 32-bit ID never matches and may be treated as
    “remote,” causing cancel/KILL to fail. On PostgreSQL, BackendKeyData only carries a 32-bit processId, but Process stores only a 64-bit connectionId and never sets postgresqlProcessId; CancelRequest looks up by 32-bit, so in a cluster it can’t find the target.
  • Cluster forwarding gap: KillProcessHandler parses IDs with Integer.parseInt; 64-bit IDs with node bits are dropped and never forwarded to the right node.
  • Persistence gap: YamlProcess/swapper only store mysqlThreadId/postgresqlProcessId and omit the 64-bit connection ID and secretKey, so cross-node cancel/kill lacks critical data.
  • Change surface too large: 71 files touched, with many unrelated format/alignment tweaks (e.g., Firebird/OpenGauss frontend), which inflates the diff and review cost.

Suggested direction: (1) Align the handshake identifier with what Process stores—either make the handshake ID reconstruct node bits or map the 32-bit handshake ID back to the 64-bit stored value. (2) PostgreSQL should store both 32-bit postgresqlProcessId and 64-bit postgresqlConnectionId; Cancel should match on 32-bit, and both the 64-bit ID and secretKey should be persisted. (3) KillProcessHandler must accept and forward 64-bit connection IDs. (4) Extend YamlProcess/swapper to include 64-bit connection IDs and secretKey. (5) Trim this PR to only the necessary changes, remove unrelated formatting tweaks, and add tests covering handshake → process bind → cancel/KILL for local and cross-node paths (32/64-bit, secretKey mismatch).

In its current form, the PR should not be merged.

Thankyou for the review and i will surely look into this

@MilanTyagi2004
Copy link
Author

hi,@terrymanu i have an doubt ,connection id is pass through firebird/opengaus , and the function written in these are in int
and we have to take 64 bitId
either we have to lower the bits bit using (int) which makes multiple touch file.
any guidance or suggestion

@ShenFeng312
Copy link

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur.
I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

@ShenFeng312
Copy link

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

I will open a new issue to discuss this problem. In the meantime, I think we should pause any further work on this PR.

@MilanTyagi2004
Copy link
Author

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

I will open a new issue to discuss this problem. In the meantime, I think we should pause any further work on this PR.

Please give me some time ,I am working on this issue

@MilanTyagi2004
Copy link
Author

MilanTyagi2004 commented Dec 24, 2025

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

I will open a new issue to discuss this problem. In the meantime, I think we should pause any further work on this PR.

Currently, with an auto-increment ID scheme, there might be a scenario where some clients use connection pools while others connect directly. Suppose the auto-increment ID overflows but some original connections haven’t been released yet. In that case, duplicate IDs could occur. I think we can manage this by maintaining a bitmap based on the maximum number of connections. This would allow us to quickly find the next available ID. Do you agree? @terrymanu @MilanTyagi2004

Thanks for pointing this out. I agree this scenario can lead to duplicate IDs when auto-increment overflows and some connections are still active (especially with mixed pooled and direct connections).

To handle this safely, I used a BitSet to track allocated connection IDs up to the maximum allowed connections. On allocation, we pick the next clear bit, and on release, we clear it. This ensures an ID is reused only after the original connection is fully released, even after overflow.

This avoids collisions, works well with connection pools, and keeps the behavior deterministic.

@MilanTyagi2004
Copy link
Author

@terrymanu what are you thoughts about this

@terrymanu
Copy link
Member

terrymanu commented Dec 24, 2025

We’ve overcomplicated this. The real goal is just to make Kill/Cancel hit the running process. The chatter exploded because the scope ballooned: we mixed in 64-bit global IDs, cross-node forwarding, persistence, and even Firebird/OpenGauss while the client protocol still exposes 32-bit IDs, and we never nailed a clear mapping. On top of that, we touched 70+ files before agreeing on the design, so formatting/compat issues drowned out the core fix.

Let’s simplify and get back on target:

  • Pick one rule and make “handshake ID == lookup ID” work end to end. Either stay with 32-bit and align MySQL/PostgreSQL handshake, ProcessRegistry, and Kill/Cancel, or if we keep 64-bit, define an exact mapping so the 32-bit client ID can resolve to the stored 64-bit.
  • Fix the main MySQL/PostgreSQL path only: handshake → process bind → registry lookup → Kill/Cancel. Don’t fan out to other protocols yet.
  • Pause unrelated formatting/refactors and shrink the diff.
  • Add a few focused tests: local/(if needed) cross-node, secret key checks, and the ID mapping.

First make the core path reliable; discuss global IDs/bitset allocation separately afterward.

@MilanTyagi2004
Copy link
Author

MilanTyagi2004 commented Dec 24, 2025

We’ve overcomplicated this. The real goal is just to make Kill/Cancel hit the running process. The chatter exploded because the scope ballooned: we mixed in 64-bit global IDs, cross-node forwarding, persistence, and even Firebird/OpenGauss while the client protocol still exposes 32-bit IDs, and we never nailed a clear mapping. On top of that, we touched 70+ files before agreeing on the design, so formatting/compat issues drowned out the core fix.

Let’s simplify and get back on target:

  • Pick one rule and make “handshake ID == lookup ID” work end to end. Either stay with 32-bit and align MySQL/PostgreSQL handshake, ProcessRegistry, and Kill/Cancel, or if we keep 64-bit, define an exact mapping so the 32-bit client ID can resolve to the stored 64-bit.
  • Fix the main MySQL/PostgreSQL path only: handshake → process bind → registry lookup → Kill/Cancel. Don’t fan out to other protocols yet.
  • Pause unrelated formatting/refactors and shrink the diff.
  • Add a few focused tests: local/(if needed) cross-node, secret key checks, and the ID mapping.

First make the core path reliable; discuss global IDs/bitset allocation separately afterward.

Thanks for the clarification. I agree the scope grew too large and diluted the core fix.

I’ll reset this PR to focus only on the essential path:
handshake → process bind → registry lookup → Kill/Cancel, limited to MySQL and PostgreSQL.

Plan:

  1. Pick a single rule so handshake ID == lookup ID end-to-end. I’ll stay with 32-bit IDs for now and align handshake,
    ProcessRegistry, and Kill/Cancel accordingly.

  2. Remove all unrelated changes (Firebird/OpenGauss, formatting, refactors) and shrink the diff.

  • Add focused tests covering:

  • handshake → bind → cancel/KILL (local)

  1. PostgreSQL secretKey validation

  2. Defer 64-bit global IDs, cross-node forwarding, persistence, and allocation/bitset discussions to a separate issue.

I’ll update the PR with a minimal, clean diff following this approach.

@MilanTyagi2004
Copy link
Author

I’ve reduced the PR to only the process ID binding during authentication, removing all unrelated changes to keep the scope minimal and focused.

@MilanTyagi2004
Copy link
Author

This change fixes the MySQL handshake ID mismatch on the executing node, which was the root cause of KILL / Statement.cancel() failing in cluster mode. Cross-node routing is intentionally not addressed here and can be discussed separately.

@MilanTyagi2004
Copy link
Author

@terrymanu please review the pr. So i can work on the cross node routing.

Copy link
Member

@terrymanu terrymanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort—good direction, but the current patch doesn’t solve the root cause and introduces new issues. Please refine and resubmit:

  • Design/abstraction issue: Process is a cross-protocol infra class. Adding mysqlThreadId and MySQL-specific lookup into Process/ProcessRegistry bakes dialect details into shared code. Please revert these changes and handle MySQL handshake thread id binding at the MySQL protocol/session layer, or introduce a protocol-agnostic “native connection identifier” abstraction instead.
  • Root cause still unsolved: The handshake thread id remains node-local and not cluster-unique; processId is still not aligned with the handshake id. During execution, ProcessEngine.executeSQL/completeSQLExecution recreates Process with the same processId, dropping the thread id you set at handshake time, so KILL can still fail to find the running process. Please first ensure the handshake id is cluster-unique (or can route to the right node) and keep it associated with the processId across the full lifecycle before wiring cancel/kill.
  • Newly introduced risks: The numeric thread-id scan in ProcessRegistry.get may kill the wrong session if thread ids collide across nodes; the new test sets threadId manually and doesn’t cover the real handshake→execute→kill flow, and it leaves the singleton ProcessRegistry state dirty. Please fix these or roll them back.
  • Unrelated/incorrect changes to revert: Remove the MySQL-specific fields and lookup logic from the infra Process/ProcessRegistry layer; redo the binding and routing in the MySQL protocol layer.

I encourage you to continue: make the handshake connection/thread id cluster-unique, keep it consistently mapped to the processId through the execution lifecycle, and add an end-to-end test covering handshake → execute → KILL/cancel. Looking forward to the updated version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Statement.cancel() in MySQL JDBC fails in a cluster environment.

3 participants