Description
We opened an issue with envoyproxy envoyproxy/envoy#27775 about it's crashing on validating bootstrap config
Call stack:
[external/envoy/source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x0
[external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[external/envoy/source/server/backtrace.h:92] Envoy version: 2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL
[external/envoy/source/server/backtrace.h:96] #0: __restore_rt [0x7f26c19d1420]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/1: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Refill() [0x55cf3f28ce2a]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/2: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Allocate<>()::Helper::Underflow() [0x55cf3f28df77]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/3: Envoy::Api::ValidationImpl::allocateDispatcher() [0x55cf3dd5b3de]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/4: Envoy::Server::ValidationInstance::ValidationInstance() [0x55cf3dd4b52f]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/5: Envoy::Server::validateConfig() [0x55cf3dd4aa65]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/6: Envoy::MainCommonBase::run() [0x55cf3dd11370]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/7: Envoy::MainCommon::main() [0x55cf3dd11a7d]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/8: main [0x55cf3dd0da4a]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/9: __libc_start_main [0x7f26c17ef083]
We actually found a potential root cause to be a tcmalloc bug that it is unable to handle non-sequential online CPUs. Those segfaults happen on ec2 instances with nitro-enclaves enabled so there are some hot-plugged off CPUs, i.e.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0,2-8,10-15
Off-line CPU(s) list: 1,9
And the theory is tcmalloc uses the cpu's id to index into the per-cpu arrays that hold the per cpu data structures. If tcmalloc allocates 14 entries because ncpu is 14, but the 14th cpu id is 15 then its array access is out of bounds.
Can you confirm if that's the valid root cause, and has it been fixed by any commit?
Thanks