Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ditac authored Dec 18, 2019
1 parent 355aa5c commit 7cce53a
Showing 1 changed file with 13 additions and 68 deletions.
81 changes: 13 additions & 68 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,92 +23,37 @@ __RCA__: An RCA is a function which operates on multiple datastreams. These data
__Symptom__: A symptom is an intermediate function operating on metric or symptom datastreams. For example - A high CPU utilization symptom can calculate moving average over k samples and categorize CPU utilization(high/low)
based on a threshold. Typically, output of a Symptom node will be binary - `presence`(true) or `absence`(false).


__Metrics__: Metric nodes query Performance Analyzer(PA) metrics and expose them as a continuous data stream to downstream nodes. They can also be extended to pull custom metrics from other data sources.

__Metrics__: Metric nodes query Performance Analyzer(PA) metrics and expose them as a continuous data stream to downstream nodes. They can be extended to pull custom metrics from other data sources.

### Components

__Framework__: The RCA runtime operates on an `AnalysisGraph`. Extend this class and override the `construct` method if you wish to deploy new RCAs. Specify the path to the class in the `analysis-graph-implementor` section of `pa_config/rca.conf`. The `addLeaf` and `addAllUpstreams` helper methods make it convenient to specify dependencies between nodes of the graph.


__Scheduler__: The scheduler executes the `operate` method on each graph node in topological order as defined in the analysis graph. Nodes with no dependency can be executed in parallel. Use flow units to share data between RCA nodes and avoid shared objects as these can cause data races and performance bottlenecks.

__WireHopper__: This is the networking-orchestrator for the framework. It abstracts out the fact
that not all graph nodes are executed in the same physical machine. During the bootstrap of the
scheduler, the wirehopper helps send the intent to the remote nodes that their data will be needed
by the bootstraping machine. While the RCA is running, for each evaluated node that has remote
subscribers, wirehopper transports the data to them. It uses GRpc internally for that. The code for
the same can be found in the `WireHopper` class.

__Tasklets__: Tasklets is essentially a Task wrapper on top of graph nodes to make them operate
seamlessly with the Java async APIs. The tasklet'e `execute` method does the main share of the work.
For each node, it waits for all the upstream nodes to complete execution and then executes itself
and if the output is desired by one or many remote machines, it takes the help of wireHopper to send
them over.

__Resources__: As stated, the RCA framework gives us the state of each resource. Now this

definition is incomplete unless we define which resources we account for. The resources are broken
into these layers:

- Network
- TCP
- Hardware
- CPU
- Memory
- Disks
- NICs
- OS
- Scheduler
- Memory Manager
- JVM
- Heap
- Garbage Collector
- JIT
- ElasticSearch
- Indices
- Shards
- Locks
- Queues
- Threads

__Context__: The context gives us the detail why the resource was deemed unhealthy. So it might
contain the data like the current value and the threshold it was compared againt. A context can be
anything that helps us later, to understand why the resource was flagged as unhealthy.

__Thresholds__: Thresholds are static values against which we compare the current values to evaluate
the symptoms and the RCAs. Though the framework gives you full flexibility of the Java language, you
can set thresholds as constant members of the RCA/Symptom classes but putting them as threshold-json
has advantages:

- Changing the Class constants will require re-deploying the jar but the threshold file changes can
be picked up dynamically.

- The threshold file let's you pick different threasholds based on instance type, disk types or an
arbitrary tags.

__Tags__: The RCA system does not understand ElasticSearch concepts such as data nodes and
master nodes. Tags are simple key-value pairs that are specified in the rca.conf. The RCA
scheduler, when it starts, reads it and assumes those to be its tags. When it evaluates the graph
nodes, if the tags of the graph node matches its own tags, it evaluates the nodes locally, if not,
it thinks this graph node is to be executed on some remote machine (remote graph node). If one such
remote node is upstream, then it send intent to consume its data. If such a node is downstream,
then, it sends the evaluated result of that graph node to all its subscribers.
__Networking__: The networking layer handles RPCs for remote RCA execution. If a host depends on a remote datastream for RCA computation, it subscribes to the datastream on startup. Subsequently, the output of every RCA execution on the upstream host is streamed to the downstream subscriber.

__WireHopper__: Interface between the scheduler and the networking layer to help in remote RCA execution.

__Context__: The context contains a brief summary of the RCA. For example - A High CPU utilization symptom context will contain the average CPU utilization when the symptom was triggered.

__Thresholds__: Thresholds are static values that must be exceeded to trigger symptoms and the RCAs. Thresholds can be dynamically updated and dont require a process restart. Thresholds often depend on hardware configuration and Elasticsearch version. The threshold store supports tags to help define any associated metadata with a threshold.

__Tags__: Tags are key-value pairs that are specified in the configuration file(rca.conf). Tags can be associated with both hosts and RCA nodes.
* RCA nodes are only executed on hosts with the exact same tags as the RCA node. A common use-case of tags is to restrict certain RCA nodes to only execute on the master node.
* Tags are also used by hosts to find and subscribe to remote datastreams. For example - A cluster-wide RCA running on the master can subscribe to datastreams from all data hosts in the cluster.

## Walkthrough of an RCA

[Link to the blog]

[Link to the design RFC]

## Building, Deploying, and Running the RCA Framework
Please refer to the [Install Guide](./INSTALL.md) for detailed information on building, installing and running the RCA framework.

## Current Limitations
* This is alpha code and is in development.
* We don't have 100% unit test coverage yet and will continue to add new unit tests. We invite developers from the larger Open Distro community to contribute and help improve test coverage and give us feedback on where improvements can be made in design, code and documentation.
* Currently we have tested and verified RCA artifacts for Docker images. We will be testing and verifying these artifacts for RPM and Debian builds as well.
* We have tested and verified RCA artifacts only for Docker images. Other distributions are pending and will be done as part of the release.

## Code of Conduct

Expand Down

0 comments on commit 7cce53a

Please sign in to comment.