Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy crash when validating bootstrap config file #709

Closed
jakewang-stripe opened this issue Jun 2, 2023 · 3 comments
Closed

Envoy crash when validating bootstrap config file #709

jakewang-stripe opened this issue Jun 2, 2023 · 3 comments

Comments

@jakewang-stripe
Copy link

Description:
Envoy occasionally crashes when validating bootstrap config

Repro steps:
It happens on 30 hosts out of Stripe's around 50k hosts. And I can login to those hosts and manually call and roughly get segfault 1 out of 2/3 times

$ sudo /pay/jenkins-artifacts/envoy/1.24.4-stripe1/envoy-stripe --config-path $bootstrap_path  --mode validate --service-cluster certhorse --service-zone us-west-2b --service-node qa-certhorse--01b66f3177b5f6cdc
Segmentation fault

Or sometimes the validate is hanging indefinitely.

Call stack:

[external/envoy/source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x0
[external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[external/envoy/source/server/backtrace.h:92] Envoy version: 2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL
[external/envoy/source/server/backtrace.h:96] #0: __restore_rt [0x7f26c19d1420]
[external/envoy/source/server/backtrace.h:96] #1: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Refill() [0x55cf3f28ce2a]
[external/envoy/source/server/backtrace.h:96] #2: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Allocate<>()::Helper::Underflow() [0x55cf3f28df77]
[external/envoy/source/server/backtrace.h:96] #3: Envoy::Api::ValidationImpl::allocateDispatcher() [0x55cf3dd5b3de]
[external/envoy/source/server/backtrace.h:96] #4: Envoy::Server::ValidationInstance::ValidationInstance() [0x55cf3dd4b52f]
[external/envoy/source/server/backtrace.h:96] #5: Envoy::Server::validateConfig() [0x55cf3dd4aa65]
[external/envoy/source/server/backtrace.h:96] #6: Envoy::MainCommonBase::run() [0x55cf3dd11370]
[external/envoy/source/server/backtrace.h:96] #7: Envoy::MainCommon::main() [0x55cf3dd11a7d]
[external/envoy/source/server/backtrace.h:96] #8: main [0x55cf3dd0da4a]
[external/envoy/source/server/backtrace.h:96] #9: __libc_start_main [0x7f26c17ef083]

uname -a:
Linux 5.15.0-1036-aws #40~20.04.1-Ubuntu SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Envoy version
2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL

2f44165e55dd47475c44d2d03018eac3cb8a6264 is internal commit of Stripe's Envoy repo, it uses OSS envoy 1.24.4

@jakewang-stripe
Copy link
Author

We actually found a potential root cause to be a tcmalloc bug that's being used in Envoy 1.24.4 that is unable to handle non-sequential online CPUs. Those segfaults happen on ec2 instances with nitro-enclaves enabled so there are some hot-plugged off CPUs, i.e.

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0,2-8,10-15
Off-line CPU(s) list:            1,9

And the theory is tcmalloc uses the cpu's id to index into the per-cpu arrays that hold the per cpu data structures. If tcmalloc allocates 14 entries because ncpu is 14, but the 14th cpu id is 15 then its array access is out of bounds.

Can you confirm if that's the valid root cause?

@jakewang-stripe
Copy link
Author

moved it to envoyproxy/envoy#27775

@jpeach
Copy link
Contributor

jpeach commented Jun 3, 2023

@jakewang-stripe is there an upstream tcmalloc issue tracking this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants