Skip to content

How to recover in writing timeout #5705

Open
@lv-stupidboy

Description

@lv-stupidboy

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id (e.g. a3ffc7d8)

How To Reproduce(required)

Steps to reproduce the behavior:
版本号3.2.1版本,且space是3副本,机器配置104C 300+内存,cpu和内存未观测到存在瓶颈,4块800G SSD磁盘,数据量20G左右

  1. 通过flink-connector进行数据写入,batch size =100,写入一段时间后graph日志显示RPC超时:
    StorageClientBase-inl.h.ext: Request to ip:9779 time out : TTransportException: Timed out
    There some RPC errors: RPC failure in storageClient with without :: TTransportException: time out
    InsertVerticesExecutor failed, error E_PRC_FAILURE, part 1
    InsertVerticesExecutor failed, error E_PRC_FAILURE, part 2
    InsertVerticesExecutor failed, error E_PRC_FAILURE, part 3
  2. 查询对应的storage日志:
    RaftPart.cpp:1033 Replicating log timed out : replicateLogLatencyUs 10001168
    RaftPart.cpp:1033 Replicating log timed out : replicateLogLatencyUs 10000230
    RaftPart.cpp:1033 Replicating log timed out : replicateLogLatencyUs 10001245
    RaftPart.cpp:1033 Replicating log timed out : replicateLogLatencyUs 10001037
    RaftPart.cpp:1033 Replicating log timed out : replicateLogLatencyUs 10001223
    .........
  3. 如上storage日志持续打印7个小时且未恢复正常,节点处于offline状态一直未恢复

Expected behavior
1、想请问下上述情况发生可能存在哪些原因
2、节点应该如何恢复
3、单个节点offline,再提交任务为何还是写入失败,其他2个副本均正常

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects/nonePR/issue: this bug affects none version.severity/noneSeverity of bugtype/bugType: something is unexpected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions