Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

[BUG] Can't initialize detector after disable/re-enable AD plugin #132

Open
ylwu-amzn opened this issue May 20, 2020 · 9 comments
Open

[BUG] Can't initialize detector after disable/re-enable AD plugin #132

ylwu-amzn opened this issue May 20, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@ylwu-amzn
Copy link
Contributor

Describe the bug

After disabling AD plugin, all detectors stopped. Re-enable AD plugin, and start detector. Found the detector state became initialization failure.

Exception:

[2020-05-19T17:27:53,722][ERROR][c.a.o.a.t.AnomalyResultTransportAction] [105773eab7d75468e6587eb835f66c75] Received an error from node FoAzrJP0QcC3q_s0z7rYtg while fetching anomaly grade for yNpUXXEBPdSnBWCFN-zN
org.elasticsearch.transport.RemoteTransportException: [8f49a9616d531ad45e3283fdf1561d21][10.212.17.141:9300][cluster:admin/ad/rcf/result]
Caused by: java.lang.IllegalArgumentException: point.length must equal 16
        at com.amazon.randomcutforest.CommonUtils.checkArgument(CommonUtils.java:40) ~[?:?]
        at com.amazon.randomcutforest.RandomCutForest.traverseForest(RandomCutForest.java:314) ~[?:?]
        at com.amazon.randomcutforest.RandomCutForest.getAnomalyScore(RandomCutForest.java:455) ~[?:?]
        at com.amazon.opendistroforelasticsearch.ad.ml.ModelManager.getRcfResult(ModelManager.java:383) ~[?:?]
        at com.amazon.opendistroforelasticsearch.ad.ml.ModelManager.getRcfResult(ModelManager.java:370) ~[?:?]
        at com.amazon.opendistroforelasticsearch.ad.transport.RCFResultTransportAction.doExecute(RCFResultTransportAction.java:61) ~[?:?]
        at com.amazon.opendistroforelasticsearch.ad.transport.RCFResultTransportAction.doExecute(RCFResultTransportAction.java:32) ~[?:?]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:154) ~[elasticsearch-7.4.2.jar:7.4.2]
        at com.amazon.opendistro.elasticsearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:77) ~[?:?]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:152) ~[elasticsearch-7.4.2.jar:7.4.2]
        at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityFilter.apply0(OpenDistroSecurityFilter.java:218) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityFilter.apply(OpenDistroSecurityFilter.java:119) ~[?:?]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:152) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:130) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:64) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:60) ~[elasticsearch-7.4.2.jar:7.4.2]
        at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceivedDecorate(OpenDistroSecuritySSLRequestHandler.java:182) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityRequestHandler.messageReceivedDecorate(OpenDistroSecurityRequestHandler.java:285) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceived(OpenDistroSecuritySSLRequestHandler.java:142) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin$7$1.messageReceived(OpenDistroSecurityPlugin.java:655) ~[?:?]
        at com.amazonaws.elasticsearch.iam.IamTransportRequestHandler.messageReceived(IamTransportRequestHandler.java:50) ~[?:?]
        at com.amazon.opendistro.elasticsearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:48) ~[?:?]
        at com.amazonaws.elasticsearch.ccs.CrossClusterRequestInterceptor$CrossClusterRequestHandler.messageReceived(CrossClusterRequestInterceptor.java:133) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:236) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:185) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
       at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]

To Reproduce

Steps to reproduce the behavior:
1.Create a detector with two features and 1minute interval. Start the detector and wait until its state becomes running.
2. Disable AD plugin

curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "opendistro.anomaly_detection.enabled": false
  }
}'
  1. Remove one feature
  2. Enable AD plugin
curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "opendistro.anomaly_detection.enabled": true
  }
}'
  1. Restart detector. Can see initialization failure state after 10minutes
@ylwu-amzn ylwu-amzn added the bug Something isn't working label May 20, 2020
@ylwu-amzn ylwu-amzn self-assigned this May 21, 2020
@ylwu-amzn
Copy link
Contributor Author

ylwu-amzn commented May 21, 2020

Root cause

Detector first initialization done, then we store model checkpoints to checkpoint index.
Then AD plugin disabled, detector stopped automatically without deleting model checkpoint. Then remove 1 feature and restart detector. The re-initialization process will retrieve model checkpoint from index and restore it. The model in checkpoint requires 2 features (16 dimensions: 2 (feature) * 8 (shingle size)). But currently the detector has only 1 feature. The mismatch will cause exception.

Solution

To be safe, we can delete model checkpoint document from index if AD job stopped automatically. Will build a new model when detector restarted.

@wnbts
Copy link
Contributor

wnbts commented May 21, 2020

Another solution would be to clear model checkpoints regardless of where the request is received or the model is hosted. I will take a deeper look and see if that is feasible.

@wnbts
Copy link
Contributor

wnbts commented May 21, 2020

by design, model data should be cleared with an update, correct?

@ylwu-amzn
Copy link
Contributor Author

Yes, when user stop detector, AD will delete checkpoint automatically. But when we stop detector for some exception, we don't delete checkpoint. I'm working on job runner to add checkpoint deletion when stop AD job.

Can you help check the model run part? When restore model from checkpoint and find the model dimension can't match current detector's feature count, we should delete the checkpoint and build new model.

@wnbts
Copy link
Contributor

wnbts commented May 21, 2020

@ylwu-amzn Dimension check cannot be relied on. For example, the features start with sum(x), avg(y) and after the update, they become max(s) and min(t). While the dimension is the same, the previous model is not usable for the new features.

@ylwu-amzn
Copy link
Contributor Author

ylwu-amzn commented May 21, 2020

Yes, agree. We can't trust the case for dimension equals feature count. At least we can check the not equal case. Ideally, we can store detector configuration in model checkpoint and check if it equals to current detector configuration. If the detector interval, window delay, indices, time field, filter or feature definition changed, we should not trust the model checkpoint.

@wnbts
Copy link
Contributor

wnbts commented May 22, 2020

Size check is not the right solution. Model versioning was the initial design but the current design changed it to relying on updates to prevent model/config mismatch.

@ylwu-amzn
Copy link
Contributor Author

ylwu-amzn commented May 22, 2020

As this problem occurs in some edge case, plan to fix it with a long-term solution rather than a work around like feature size check. Here are some options.

Option1: delete checkpoints for every update

When update detector, no matter detector is running or not, will delete current checkpoint.

Pros:

  1. Only remove checkpoint when detector updated. So we can still use the checkpoint if detector just stopped without any update.

Cons:

  1. We can only handle the update from REST API. If user use PUT to update detector directly, we still have the problem that checkpoint can't match new detector configuration.
  2. Can't handle the case that fail to delete checkpoint
  3. If detector stopped and no update from then on, we will keep the checkpoint. We will restore model from the checkpoint. But if detector stopped for a long time and the data pattern changed, the checkpoint may be not suitable for current data pattern at first.

Option2: delete checkpoints when stop detector

When detector stopped, no matter from REST API or from EndRunException, will delete current checkpoint.

Pros:

  1. Have no mismatch problem.

Cons:

  1. Detector needs to rebuilt model when restart.
  2. Can't handle the case that fail to delete checkpoint

Option3: store detector configuration in checkpoint and compare with latest detector configuration

We store both model and detector configuration in checkpoint. We check latest detector configuration with the checkpoint, if not match, throw EndRunException.

Pros:

  1. Have no mismatch problem.
  2. Can use the model from checkpoint without rebuild new model.

Cons:

  1. More effort, need to change checkpoint index, change logic to parse checkpoints and compare detector configuration.
  2. Old model may be not suitable for current data pattern at first. So may have some false positive/negative. But model will be updated by new streaming data points and become stable.

@wnbts
Copy link
Contributor

wnbts commented May 22, 2020

Thanks for the redesigns. First two options are finicky and lead to poor customer experiences.

The third option is better in all regards and is the closest to the original design except for some minor details.

  1. If not match, the scorer will reject requests and raise ResourceNotFoundException to kick off training/model generation based on the changed detector configuration. So the detector will resume to work once training/model generation is completed without requiring user actions or producing false results.

The efforts are manageable and can be done in steps.

  1. Pass current detector to all requests to models.
  2. Change model internals to store, retrieve, and verify detector.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants