Skip to content

Commit

Permalink
Add Kmeans and AD command documentation (#493) (#503)
Browse files Browse the repository at this point in the history
Signed-off-by: jackieyanghan <jkhanjob@gmail.com>
(cherry picked from commit ee4bce0)

Co-authored-by: Jackie Han <41348518+jackiehanyang@users.noreply.github.com>
  • Loading branch information
1 parent 38903b6 commit 1f0d7e1
Show file tree
Hide file tree
Showing 9 changed files with 2,082 additions and 2 deletions.
1 change: 1 addition & 0 deletions docs/category.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
"user/admin/settings.rst"
],
"ppl_cli": [
"user/ppl/cmd/ad.rst",
"user/ppl/cmd/dedup.rst",
"user/ppl/cmd/eval.rst",
"user/ppl/cmd/fields.rst",
Expand Down
61 changes: 61 additions & 0 deletions docs/user/ppl/cmd/ad.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
=============
ad
=============

.. rubric:: Table of contents

.. contents::
:local:
:depth: 2


Description
============
| The ``ad`` command applies Random Cut Forest (RCF) algorithm in ml-commons plugin on the search result returned by a PPL command. Based on the input, two types of RCF algorithms will be utilized: fixed in time RCF for processing time-series data, batch RCF for processing non-time-series data.

Fixed In Time RCF For Time-series Data Command Syntax
=====================================================
ad <shingle_size> <time_decay> <time_field>

* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
* time_decay: optional. It specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001.
* time_field: mandatory. It specifies the time filed for RCF to use as time-series data.


Batch RCF for Non-time-series Data Command Syntax
=================================================
ad <shingle_size> <time_decay>

* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
* time_decay: optional. It specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001.


Example1: Detecting events in New York City from taxi ridership data with time-series data
==========================================================================================

The example trains a RCF model and use the model to detect anomalies in the time-series ridership data.

PPL query::

os> source=nyc_taxi | fields value, timestamp | AD time_field='timestamp' | where value=10844.0'
+----------+---------------+-------+---------------+
| value | timestamp | score | anomaly_grade |
|----------+---------------+-------+---------------|
| 10844.0 | 1404172800000 | 0.0 | 0.0 |
+----------+---------------+-------+---------------+


Example2: Detecting events in New York City from taxi ridership data with non-time-series data
==============================================================================================

The example trains a RCF model and use the model to detect anomalies in the non-time-series ridership data.

PPL query::

os> source=nyc_taxi | fields value | AD | where value=10844.0'
+----------+--------+-----------+
| value | score | anomalous |
|----------+--------+-----------|
| 10844.0 | 0.0 | false |
+----------+--------+-----------+
38 changes: 38 additions & 0 deletions docs/user/ppl/cmd/kmeans.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
=============
kmeans
=============

.. rubric:: Table of contents

.. contents::
:local:
:depth: 2


Description
============
| The ``kmeans`` command applies kmeans algorithm in ml-commons plugin on the search result returned by a PPL command.

Syntax
======
kmeans <cluster-number>

* cluster-number: mandatory. The number of clusters you want to group your data points into.


Example: Clustering of Iris Dataset
===================================

The example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals.

PPL query::

os> source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3
+--------------------+-------------------+--------------------+-------------------+-----------+
| sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID |
|--------------------+-------------------+--------------------+-------------------+-----------|
| 5.1 | 3.5 | 1.4 | 0.2 | 1 |
| 5.6 | 3.0 | 4.1 | 1.3 | 0 |
| 6.7 | 2.5 | 5.8 | 1.8 | 2 |
+--------------------+-------------------+--------------------+-------------------+-----------+
4 changes: 4 additions & 0 deletions docs/user/ppl/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,16 @@ The query start with search command and then flowing a set of command delimited

- `Syntax <cmd/syntax.rst>`_

- `ad command <cmd/ad.rst>`_

- `dedup command <cmd/dedup.rst>`_

- `eval command <cmd/eval.rst>`_

- `fields command <cmd/fields.rst>`_

- `kmeans command <cmd/kmeans.rst>`_

- `parse command <cmd/parse.rst>`_

- `rename command <cmd/rename.rst>`_
Expand Down
14 changes: 13 additions & 1 deletion doctest/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import java.util.concurrent.Callable
import org.opensearch.gradle.testclusters.RunTask

plugins {
Expand Down Expand Up @@ -49,7 +50,18 @@ clean.dependsOn(cleanBootstrap)

testClusters {
docTestCluster {
plugin ':plugin'
plugin(provider(new Callable<RegularFile>(){
@Override
RegularFile call() throws Exception {
return new RegularFile() {
@Override
File getAsFile() {
return fileTree("resources/ml-commons").getSingleFile()
}
}
}
}))

testDistribution = 'integ_test'
}
}
Expand Down
Binary file not shown.
Loading

0 comments on commit 1f0d7e1

Please sign in to comment.