Skip to content

[ML] macOS: Controller process sometimes terminated with SIGKILL #2429

Open
@davidkyle

Description

@davidkyle

Some developers working in the Elasticsearch repository have reported intermittent problems with the machine learning controller process crashing when running a locally built Elasticsearch. The usual symptoms are Elasticsearch will fail to start and the log will contain this message:

[ERROR][o.e.b.Elasticsearch      ] [runTask-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: Failure running machine learning native code. This could be due to running on an unsupported OS or distribution, missing OS libraries, or a problem with the temp directory. To bypass this problem by running Elasticsearch without machine learning functionality set [xpack.ml.enabled: false].

A crash report can be found in the macOS Console app.

Path:                /Users/USER/*/controller.app/Contents/MacOS/controller
Identifier:          co.elastic.ml-cpp.controller
Version:             8.7.0
Code Type:           ARM-64 (Native)

Exception Type:  EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
Exception Subtype: UNKNOWN_0x32 at 0x000000010249c000
Exception Codes: 0x0000000000000032, 0x000000010249c000
VM Region Info: 0x10249c000 is in 0x10249c000-0x1024b0000;  bytes after start: 0  bytes before end: 81919
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  mapped file                 10249c000-1024b0000    [   80K] r-x/r-x SM=COW  ...t_id=787f689f
      mapped file                 1024b0000-1024b4000    [   16K] rw-/rw- SM=COW  ...t_id=787f689f
Exception Note:  EXC_CORPSE_NOTIFY
Termination Reason: CODESIGNING 2 

The error has been observed on Apple silicon only (so far).

Reproducing

It has not been possible to reproduce reliably but once the problem occurs a crash report can be generated by running controller --help in the Elasticsearch repository.

<ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller --help

Running the app from a different location works ?!

Copy the app to a folder in the home directory and running the copy does not result in a crash:

cd ~/Desktop
cp -r <ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app .
controller.app/Contents/MacOS/controller --help

Possible Causes

macOS Quarantine

No.

The downloaded controller.app does not have the the quarantine attribute set.
find . -xattrname com.apple.quarantine returns nothing.

Security Policy

No.
After disabling security with sudo spctl --global-disable the controller app still crashes.

cd <ES_REPO>
sudo spctl --global-disable
sudo spctl --asses -vv ./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller

./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller: accepted
override=security disabled

echo $?
0

When security is enabled the spctl --assess function returns the same message as codesign --verify

cd <ES_REPO>
sudo spctl --asses -vv ./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller

./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller: code has no resources but signature indicates they must be present

echo $?
1

Code Signing

Maybe.

The crash report indicates code signing is involved

Exception Type:  EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))

and

Termination Reason: CODESIGNING 2 

Verifying the signing returns an error message

cd <ES_REPO>
codesign -d --verify --verbose=4 ./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app


./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller: code has no resources but signature indicates they must be present

It is not clear if that is a terminal error however.

Workarounds

In the commands below replace elasticsearch-8.7.0-SNAPSHOT with your version.

  • Deleting the bundled app from the local build and rebuilding is the most reliable fix:
cd <ES_REPO>
rm -rf distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app
./gradlew run
  • Resigning the app with an ad-hoc signature works for some
 codesign --force --deep --sign - <ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app
  • If all else fails Restart the Machine

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions