[Phase 1][colocation] Check failed: Table not found in Raft group #10100

iswarezwp · 2021-09-24T02:12:25Z

Jira Link: [DB-371](https://yugabyte.atlassian.net/browse/DB-371)
For colocated database, the tablet leader may accidentally send the following update request to the follower which lost some data， the follower currently did not handle add_table request，cause a Check failed error:

ops { 
    id { term: 5 index: 2 } 
    hybrid_time: 6685864924336226306 
    op_type: NO_OP committed_op_id { term: 1 index: 1  } noop_request {}
} 
ops { 
    id { term: 5 index: 3 } 
    hybrid_time: 6685864989234249729 
    op_type: CHANGE_METADATA_OP 
    change_metadata_request { 
        tablet_id: "8d800a40ecc2415ba44274530bafdc13" 
        add_table { 
            table_id: "00005ac4000030008000000000005ac6" 
            table_name: "stack" 
            table_type: PGSQL_TABLE_TYPE 
            schema { 
                columns { id: 0 name: "stack_id" type { main: INT64 } is_key: true is_nullable: false is_static: false is_counter: false sorting_type: 3 order: 1 } 
                columns { id: 1 name: "stack_name" type { main: STRING } is_key: false is_nullable: true is_static: false is_counter: false sorting_type: 0 order: 2 } 
                columns { id: 2 name: "stack_version" type { main: STRING } is_key: false is_nullable: true is_static: false is_counter: false sorting_type: 0 order: 3 } 
                table_properties { contain_counters: false is_transactional: true consistency_level: STRONG use_mangled_column_name: false num_tablets: 1 is_ysql_catalog_table: false retain_delete_markers: false } 
                colocated_table_id {}
            } 
            schema_version: 0 
            partition_schema {}
        }
    } 
    committed_op_id { term: 5 index: 2} 
    monotonic_counter: 0
} 
ops { 
    id { term: 5 index: 4 } 
    hybrid_time: 6685864990254014465 
    op_type: CHANGE_METADATA_OP 
    change_metadata_request { 
        tablet_id: "8d800a40ecc2415ba44274530bafdc13" 
        schema { 
            columns { id: 0 name: "stack_id" type { main: INT64 } is_key: true is_nullable: false is_static: false is_counter: false sorting_type: 3 order: 1 } 
            columns { id: 1 name: "stack_name" type { main: STRING } is_key: false is_nullable: true is_static: false is_counter: false sorting_type: 0 order: 2 } 
            columns { id: 2 name: "stack_version" type { main: STRING } is_key: false is_nullable: true is_static: false is_counter: false sorting_type: 0 order: 3 } 
            table_properties { contain_counters: false is_transactional: true consistency_level: STRONG use_mangled_column_name: false num_tablets: 1 is_ysql_catalog_table: false retain_delete_markers: false} 
            colocated_table_id {}
        } 
        schema_version: 1 
        new_table_name: "stack" 
        dest_uuid: "aa532760616c497e8340744e21aa44fe" 
        propagated_hybrid_time: 6685864990254014464 
        indexes { table_id: "00005ac4000030008000000000005acb" is_unique: true indexed_table_id: "00005ac4000030008000000000005ac6"} alter_table_id: "00005ac4000030008000000000005ac6"
    } 
    committed_op_id { term: 5 index: 3 } 
    monotonic_counter: 0
}

this is the stack:

@     0x7fab40f7111e  (unknown)
@     0x7fab418ef592  __GI_pread64
@     0x7fab46227097  google::OpenObjectFileContainingPcAndGetStartAddress()
@     0x7fab462279d1  google::SymbolizeAndDemangle()
@     0x7fab462285a2  google::DumpStackTrace()
@     0x7fab46ff114b  yb::GetStackTrace()
@     0x7fab4704408a  yb::LogFatalHandlerSink::send()
@     0x7fab4621dcac  google::LogMessage::SendToLog()
@     0x7fab4621ae38  google::LogMessage::Flush()
@     0x7fab4621e527  google::LogMessageFatal::~LogMessageFatal()
@     0x7fab4fc7c2bd  yb::tablet::Tablet::CreatePreparedChangeMetadata()
@     0x7fab4fd0b68e  yb::tablet::ChangeMetadataOperation::Prepare()
@     0x7fab4fd0ffcd  yb::tablet::OperationDriver::PrepareAndStart()
@     0x7fab4fd10c8d  yb::tablet::OperationDriver::PrepareAndStartTask()
@     0x7fab4fd42af4  yb::tablet::PreparerImpl::Submit()
@     0x7fab4fd42cc2  yb::tablet::Preparer::Submit()

The text was updated successfully, but these errors were encountered:

ddorian · 2021-09-29T15:59:14Z

Hi @iswarezwp

What version are you using ?

iswarezwp · 2021-09-30T03:26:13Z

I think maybe the problem is that, RaftConsensus::UpdateReplica batch prepare then batch write the ops in one Update request:

Result<RaftConsensus::UpdateReplicaResult> RaftConsensus::UpdateReplica(
    ConsensusRequestPB* request, ConsensusResponsePB* response) {

   ......
   // 2 - Enqueue the prepares
  if (!VERIFY_RESULT(EnqueuePreparesUnlocked(*request, &deduped_req, response))) {
    return UpdateReplicaResult();
  }

  if (deduped_req.committed_op_id.index < prev_committed_op_id.index) {
    deduped_req.committed_op_id = prev_committed_op_id;
  }

  // 3 - Enqueue the writes.
  auto last_from_leader = EnqueueWritesUnlocked(
      deduped_req, WriteEmpty(prev_committed_op_id != deduped_req.committed_op_id));

  ......

In the prepare phase, the add_table ops is not yet persistent, so the next ops with a alter_table_id can not be found.

@ddorian This problem is found in a Yugabyte 2.4.5 environment, but maybe all version is affected.

iswarezwp · 2021-09-30T03:56:56Z

logs may help:

I0923 17:42:28.195777 2728 raft_consensus.cc:1578] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Refusing update from remote peer aa532760616c497e8340744e21aa44fe: Log matching property violated. Preceding OpId in replica: 1.1. Preceding OpId from leader: 9.463. (index mismatch)
I0923 17:42:28.195837 2728 raft_consensus.cc:1808] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Returning from UpdateConsensus because of error: error { code: PRECEDING_ENTRY_DIDNT_MATCH status { code: ILLEGAL_STATE message: "Log matching property violated. Preceding OpId in replica: 1.1. Preceding OpId from leader: 9.463. (index mismatch)" source_file: "../../../../../src/yb/consensus/raft_consensus.cc" source_line: 1576 errors: "\000" } }
I0923 17:42:28.215059 1947 raft_consensus.cc:3076] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Calling mark dirty synchronously for reason code NEW_LEADER_ELECTEDI0923 17:42:28.215078 1947 replica_state.cc:1170] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Reset our lease: 12265.164s
I0923 17:42:28.215092 1947 raft_consensus.cc:1901] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Early marking committed up to { term: 0 index: 0 }
I0923 17:42:28.215099 1947 replica_state.cc:860] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Already marked ops through { term: 1 index: 1 } as committed. Now trying to mark { term: 0 index: 0 } which would be a no-op.I0923 17:42:28.215116 1947 raft_consensus.cc:2541] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Starting consensus round: term: 5 index: 2
I0923 17:42:28.215127 1947 raft_consensus.cc:1447] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Starting operation: term: 5 index: 3
I0923 17:42:28.215242 1947 raft_consensus.cc:1447] T 8d800a40ecc2415ba44274530bafdc13 P 8e25de60adb84cd092c1bc309d75e08a [term 9 FOLLOWER]: Starting operation: term: 5 index: 4
F0923 17:42:28.215376 1947 tablet.h:496] Check failed: _s.ok() Bad status: Not found: Table <unknown_table_name> (00005ac4000030008000000000005ac6) not found in Raft group 8d800a40ecc2415ba44274530bafdc13

bmatican · 2021-10-04T21:43:48Z

@iswarezwp Is this related to #6096 ? the fix for that landed later in 2.6/2.7 versions.

If not, can you repro this in your version? Do you have a set of repro steps that you are using?

iswarezwp · 2021-10-08T01:57:16Z

@bmatican This problem happened while handle raft update request, not on local WAL replay.
The the table is not dropped, but seems only two nodes had written this table's add table metadata WAL log.
Sorry, I could not find how to repro this problem, maybe a network packet loss while replicating add table operation between tservers.

jaki · 2021-11-09T23:38:44Z

Got something similar when running TestPgRegressGin CREATE INDEX on TABLEGROUP: https://gist.github.com/jaki/d5d2bf15965329a20ac92084afe16c45/raw/f7f496a1c8dd850dcf87f0b1d67803ddb15b8c92/issue10100.TestPgRegressGin.log.

ts3|pid32001|:21228|http://127.240.233.250:19467 F1109 14:31:46.481175 32099 tablet.h:543] Check failed: _s.ok() Bad status: Not found (yb/tablet/tablet_metadata.cc:338): Table <unknown_table_name> (000030af000030008000000000004080) not found in Raft group 8a0c367f8afe45bfb572d248cf282129

This is centos-release-gcc with unrelated changes off recent master commit b472191.

arpang · 2022-09-01T09:02:24Z

@jaki I ran TestPgRegressGin 100 times (only yb_ybgin_misc because only it contains tablegroup-related tests), but could not reproduce this issue. Is it that we have updated the tests? I understand it has been quite some time and you may not recall. Just asking in case you do.

Huqicheng · 2022-12-14T05:15:13Z

Fixed by a5455f3

ddorian assigned bmatican Sep 29, 2021

ddorian added the area/docdb YugabyteDB core features label Sep 29, 2021

bmatican assigned m-iancu Nov 5, 2021

polarweasel mentioned this issue Nov 10, 2021

[Docs] Add warning on colocated tables #10551

Closed

m-iancu added this to YQL-beta Dec 7, 2021

tverona1 changed the title ~~Check failed: Table not found in Raft group~~ [Phase 1] [colocation] Check failed: Table not found in Raft group May 17, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 17, 2022

frozenspider changed the title ~~[Phase 1] [colocation] Check failed: Table not found in Raft group~~ [Phase 1][colocation] Check failed: Table not found in Raft group May 18, 2022

frozenspider mentioned this issue May 18, 2022

[YSQL] Database-level multi-tenancy with tablegroups #11665

Open

yugabyte-ci unassigned m-iancu Jun 9, 2022

tverona1 mentioned this issue Aug 23, 2022

Colocation GA: Validation - Stress testing #13735

Open

yugabyte-ci assigned arpang and unassigned bmatican Aug 25, 2022

yugabyte-ci unassigned arpang Sep 19, 2022

yifanguan mentioned this issue Dec 9, 2022

[YSQL] Tablegroup child table not found in parent table tablet #14873

Closed

Huqicheng closed this as completed Dec 14, 2022

Huqicheng moved this to Done in YQL-beta Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 1][colocation] Check failed: Table not found in Raft group #10100

[Phase 1][colocation] Check failed: Table not found in Raft group #10100

iswarezwp commented Sep 24, 2021 •

edited by jira bot

Loading

ddorian commented Sep 29, 2021

iswarezwp commented Sep 30, 2021 •

edited

Loading

iswarezwp commented Sep 30, 2021

bmatican commented Oct 4, 2021

iswarezwp commented Oct 8, 2021

jaki commented Nov 9, 2021

arpang commented Sep 1, 2022

Huqicheng commented Dec 14, 2022

[Phase 1][colocation] Check failed: Table not found in Raft group #10100

[Phase 1][colocation] Check failed: Table not found in Raft group #10100

Comments

iswarezwp commented Sep 24, 2021 • edited by jira bot Loading

ddorian commented Sep 29, 2021

iswarezwp commented Sep 30, 2021 • edited Loading

iswarezwp commented Sep 30, 2021

bmatican commented Oct 4, 2021

iswarezwp commented Oct 8, 2021

jaki commented Nov 9, 2021

arpang commented Sep 1, 2022

Huqicheng commented Dec 14, 2022

iswarezwp commented Sep 24, 2021 •

edited by jira bot

Loading

iswarezwp commented Sep 30, 2021 •

edited

Loading