Skip to content

Next-gen columnar: harden RegionError/LockedError handling in StorageDisaggregatedColumnar #10847

@JaySon-Huang

Description

@JaySon-Huang

Background

Follow-up from code review on the columnar read path introduced for #10844 (Support new columnar storage as data source).

In RNProxyReader::createProxyReader (dbms/src/Storages/StorageDisaggregatedColumnar.cpp), the ColumnarReaderErrorType::RegionError and LockedError branches have several robustness issues that can mis-classify errors or lose diagnostics.

Original review comment: #10842 (comment)

Problems

1. Unchecked ParseFromString

region_error.ParseFromString(error_msg) (and lock_info.ParseFromString(error_msg)) return values are ignored. If the proxy returns a malformed payload, we proceed with a default-constructed protobuf, fall into the else branch of the RegionError handler, and may throw RegionException with NOT_FOUND and no useful diagnostic.

Suggested fix: Check the return value; on failure, log payload size (and optionally a short debug hint) and throw Exception with ErrorCodes::COLUMNAR_SNAPSHOT_ERROR (consistent with other columnar reader error paths that backoff-retry).

2. Lossy region_id_ver on epoch_not_match

In the epoch_not_match branch, region_id_ver is overwritten on every iteration of current_regions(), so RegionException's extra message only reflects the last region even when unavailable_regions contains many.

Additionally, the string currently uses the outer request's region_ver instead of each region's region.region_epoch().version().

Suggested fix: Build a joined diagnostic string (e.g. via FmtBuffer) listing all affected regions as id:version:conf_ver, using each region's epoch from current_regions(). Ensure the string outlives the RegionException constructor (store in a local String before throw).

3. Dead retry_regions variable

retry_regions is populated in both the epoch_not_match and region_not_found sub-branches but never read. This looks like an incomplete port from StorageDisaggregatedRemote.cpp, where retry_regions is passed to dropRegionCache.

Suggested fix (minimal): Remove the dead variable if dropping only region_ver_id (the stale request epoch) is intentional.

Optional follow-up: If split/merge scenarios require invalidating cache for all regions in current_regions(), wire explicit dropRegion calls instead of keeping an unused set (align with pingcap region-cache semantics).

4. (Optional) Tighter classification in the RegionError else branch

When parsing succeeds but the error is neither epoch_not_match nor region_not_found, we still throw RegionException with NOT_FOUND and may leave unavailable_regions empty. Consider mapping known errorpb::Error variants (as in LearnerReadWorker) or using OTHER / COLUMNAR_SNAPSHOT_ERROR for unknown cases, with ShortDebugString() in logs and extra_msg.

Location

  • File: dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • Function: RNProxyReader::createProxyReader (~lines 461–537)

Acceptance criteria

  • ParseFromString failures are handled explicitly for both RegionError and LockedError payloads
  • epoch_not_match exceptions include complete, correct epoch info for all regions in current_regions()
  • No dead retry_regions (either removed or used for cache invalidation)
  • (Optional) Unknown region errors are not mis-reported as NOT_FOUND

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions