Skip to content

Conversation

@StekPerepolnen
Copy link
Collaborator

@StekPerepolnen StekPerepolnen commented Apr 10, 2024

#3541

3 cases for HC now

  • all static groups have at least degraded or full statuses (everything is fine) - same as before
  • some static groups have UNKNOWN or DISINTEGRATED status according to bsc
    hc starts sends whiteboard requests to gather information on these specific groups
  • there is no bsc within half of the timeout period
    hc also begins sending whiteboard requests to gather information on static groups. new STORAGE RED issue says that there was lack of BSC information

Static config configuration goes from NodeWarden, there is no BSConfig in AppData (it can change on the fly)

Testing

HC specific database report when no bsc

{
  "self_check_result": "EMERGENCY",
  "issue_log": [
    {
      "id": "RED-1c0c-be81",
      "status": "RED",
      "message": "Database has storage issues",
      "location": {
        "database": {
          "name": "/slice/db"
        }
      },
      "reason": [
        "RED-1c0c-53b5"
      ],
      "type": "DATABASE",
      "level": 1
    },
    {
      "id": "RED-1c0c-53b5",
      "status": "RED",
      "message": "System tablet BSC didn't provide information",
      "location": {
        "database": {
          "name": "/slice/db"
        }
      },
      "type": "STORAGE",
      "level": 2
    }
  ],
  "location": {
    "id": 5,
    "host": "man0-0028.ydb-dev.nemax.nebiuscloud.net",
    "port": 19001
  }
}

  • there are RED issue with lack of BSC information
HC root report when no bsc

{
  "self_check_result": "EMERGENCY",
  "issue_log": [
    {
      "id": "RED-27c3-70fb",
      "status": "RED",
      "message": "Database has multiple issues",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-27c3-4e47",
        "RED-27c3-53b5",
        "YELLOW-27c3-5321"
      ],
      "type": "DATABASE",
      "level": 1
    },
    {
      "id": "RED-27c3-4e47",
      "status": "RED",
      "message": "Compute has issues with system tablets",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-27c3-c138-BSController"
      ],
      "type": "COMPUTE",
      "level": 2
    },
    {
      "id": "RED-27c3-c138-BSController",
      "status": "RED",
      "message": "System tablet is unresponsive",
      "location": {
        "compute": {
          "tablet": {
            "type": "BSController",
            "id": [
              "72057594037989391"
            ]
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "type": "SYSTEM_TABLET",
      "level": 3
    },
    {
      "id": "RED-27c3-53b5",
      "status": "RED",
      "message": "System tablet BSC didn't provide information",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "type": "STORAGE",
      "level": 2
    },
    {
      "id": "YELLOW-27c3-5321",
      "status": "YELLOW",
      "message": "Storage degraded",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "YELLOW-27c3-595f-8d1d"
      ],
      "type": "STORAGE",
      "level": 2
    },
    {
      "id": "YELLOW-27c3-595f-8d1d",
      "status": "YELLOW",
      "message": "Pool degraded",
      "location": {
        "storage": {
          "pool": {
            "name": "static"
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "YELLOW-27c3-ef3e-0"
      ],
      "type": "STORAGE_POOL",
      "level": 3
    },
    {
      "id": "RED-84d8-3-3-1",
      "status": "RED",
      "message": "PDisk is not available",
      "location": {
        "storage": {
          "node": {
            "id": 3,
            "host": "man0-0026.ydb-dev.nemax.nebiuscloud.net",
            "port": 19001
          },
          "pool": {
            "group": {
              "vdisk": {
                "pdisk": [
                  {
                    "id": "3-1",
                    "path": "/dev/disk/by-partlabel/NVMEKIKIMR01"
                  }
                ]
              }
            }
          }
        }
      },
      "type": "PDISK",
      "level": 6
    },
    {
      "id": "RED-27c3-4847-3-0-1-0-2-0",
      "status": "RED",
      "message": "VDisk is not available",
      "location": {
        "storage": {
          "node": {
            "id": 3,
            "host": "man0-0026.ydb-dev.nemax.nebiuscloud.net",
            "port": 19001
          },
          "pool": {
            "name": "static",
            "group": {
              "vdisk": {
                "id": [
                  "0-1-0-2-0"
                ]
              }
            }
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-84d8-3-3-1"
      ],
      "type": "VDISK",
      "level": 5
    },
    {
      "id": "YELLOW-27c3-ef3e-0",
      "status": "YELLOW",
      "message": "Group degraded",
      "location": {
        "storage": {
          "pool": {
            "name": "static",
            "group": {
              "id": [
                "0"
              ]
            }
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-27c3-4847-3-0-1-0-2-0"
      ],
      "type": "STORAGE_GROUP",
      "level": 4
    }
  ],
  "location": {
    "id": 5,
    "host": "man0-0028.ydb-dev.nemax.nebiuscloud.net",
    "port": 19001
  }
}

  • there is report on bad bsc tablet
  • there is RED issue with lack of BSC information
  • there is proper issues on static group disks here

@github-actions
Copy link

github-actions bot commented Apr 10, 2024

2024-04-10 07:07:32 UTC Pre-commit check for 1c69ed5 has started.
2024-04-10 07:07:34 UTC Build linux-x86_64-relwithdebinfo is running...
🟢 2024-04-10 07:10:36 UTC Build successful.
2024-04-10 07:12:17 UTC Tests are running...
🔴 2024-04-10 08:15:07 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
10069 9991 0 5 61 12

@github-actions
Copy link

github-actions bot commented Apr 10, 2024

2024-04-10 07:08:26 UTC Pre-commit check for 1c69ed5 has started.
2024-04-10 07:08:29 UTC Build linux-x86_64-release-clang14 is running...
🟢 2024-04-10 07:18:38 UTC Build successful.

@github-actions
Copy link

github-actions bot commented Apr 10, 2024

2024-04-10 07:09:30 UTC Pre-commit check for 1c69ed5 has started.
2024-04-10 07:09:32 UTC Build linux-x86_64-release-asan is running...
🟢 2024-04-10 07:12:16 UTC Build successful.
2024-04-10 07:14:01 UTC Tests are running...
🔴 2024-04-10 08:42:47 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
8915 8711 0 38 148 18

@StekPerepolnen StekPerepolnen changed the title hc fallback whiteboard hc fallback when static group has unknown status Apr 10, 2024
@StekPerepolnen StekPerepolnen requested a review from adameat April 12, 2024 11:47
@StekPerepolnen StekPerepolnen force-pushed the hc-fallback-whiteboard branch 2 times, most recently from d3a57de to a371c9f Compare April 23, 2024 11:01
@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:01:24 UTC Pre-commit check for 9f9e48d has started.
2024-04-23 11:01:25 UTC Check cancelled

@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:02:37 UTC Pre-commit check for 0c75606 has started.
2024-04-23 11:02:39 UTC Build linux-x86_64-release-clang14 is running...
🟢 2024-04-23 11:04:41 UTC Build successful.

@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:02:40 UTC Pre-commit check for 0c75606 has started.
2024-04-23 11:02:43 UTC Build linux-x86_64-release-asan is running...
🟢 2024-04-23 11:05:33 UTC Build successful.
2024-04-23 11:09:09 UTC Tests are running...
🔴 2024-04-23 11:28:05 UTC Test run completed, no test results found for commit a371c9f. Please check build logs.
2024-04-23 11:28:09 UTC Check cancelled

@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:03:18 UTC Pre-commit check for 0c75606 has started.
2024-04-23 11:03:20 UTC Build linux-x86_64-relwithdebinfo is running...
🟢 2024-04-23 11:05:30 UTC Build successful.
2024-04-23 11:09:02 UTC Tests are running...
🔴 2024-04-23 11:28:03 UTC Test run completed, no test results found for commit a371c9f. Please check build logs.
2024-04-23 11:28:07 UTC Check cancelled

@StekPerepolnen StekPerepolnen force-pushed the hc-fallback-whiteboard branch from a371c9f to 3b7dc8b Compare April 23, 2024 11:27
@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:31:33 UTC Pre-commit check for c74a287 has started.
2024-04-23 11:31:34 UTC Build linux-x86_64-release-asan is running...
🟢 2024-04-23 11:34:07 UTC Build successful.
2024-04-23 11:35:47 UTC Tests are running...
🔴 2024-04-23 13:14:33 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
8921 8773 0 43 87 18

@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:31:57 UTC Pre-commit check for c74a287 has started.
2024-04-23 11:32:01 UTC Build linux-x86_64-release-clang14 is running...
🟢 2024-04-23 11:34:30 UTC Build successful.

@github-actions
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:31:59 UTC Pre-commit check for c74a287 has started.
2024-04-23 11:32:03 UTC Build linux-x86_64-relwithdebinfo is running...
🟢 2024-04-23 11:34:38 UTC Build successful.
2024-04-23 11:36:25 UTC Tests are running...
🔴 2024-04-23 12:50:40 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
12870 10978 0 15 1860 17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants