attempt at sigterms example

polyfractal · polyfractal · commit b100436ffe46 · 2014-09-08T19:18:09.000-04:00
diff --git a/300_Aggregations/75_sigterms.asciidoc b/300_Aggregations/75_sigterms.asciidoc
@@ -1,7 +1,373 @@
 
 === Significant Terms Syntax
 
-Let's index some financial data which represents transactions between
-merchants and customers:
+Because the Significant Terms (SigTerms) aggregation works by analyzing 
+statistics, you need to have a certain threshold of data for it to become effective.
+That means we won't be able to index a small amount of toy data for the demo.
+
+Instead, we have a pre-prepared dataset of around 1.2m documents.  This is
+saved as a Snapshot (for more information about Snapshot and Restore, see
+<<backing-up-your-cluster>>) in our public demo repository.  You can "restore" 
+this dataset into your cluster using these commands:
+
+TODO -- add information about the public repo
+
+Note: The dataset is around 120mb and may take some time to download.
+
+Let's take a look at some sample data, to get a feel for what we are working with:
+
+[source,js]
+----
+GET /xyzbank/transaction/_search <1>
+
+{
+   "took": 23,
+   "timed_out": false,
+   "_shards": {...},
+   "hits": {
+      "total": 1285378,
+      "max_score": 1,
+      "hits": [
+         {
+            "_index": "xyzbank",
+            "_type": "transaction",
+            "_id": "6OZ186ogTe6kz7ZQkQ5P7g",
+            "_score": 1,
+            "_source": {
+               "offset": 35832,
+               "bytes": 53,
+               "payee": 1654865987, <2>
+               "payer": 1946731689, <3>
+               "randInt": 689
+            }
+         },
+----
+<1> Execute a search without a query, so that we can see a random sampling of docs
+<2> The party getting paid (e.g. the merchant)
+<3> The party that is paying (e.g. the customer)
+
+
+These documents represent transactions between customers and merchants.  Each document
+has a payer and payee field, which represent the IDs of various customers/merchants.
+You can ignore `offset`, `bytes` and `randInt`, they are artifacts of the process
+used to generate the data.
+
+In this demo, you are playing the role of a credit card auditor.  You've been 
+informed that a number of customers have recently been defrauded via stolen
+credit card numbers.  You need to find the merchant (or merchants!) who are
+responsible.
+
+Let's take a simple approach first.
+
+[source,js]
+----
+GET /xyzbank/transaction/_search?search_type=count
+{
+  "query": {
+    "filtered": {
+      "filter": {
+        "terms": {
+          "payer": [
+            720968604,1579227144,512961472,75150786,450556979, <1>
+            1257085164,1147825721,91398831,981907205,869569457,
+            1389973594,277221401,996175755,1625580913,143444417,
+            2013476680,1856942381,237287347,1536133807
+          ]
+        }
+      }
+    }
+  }, 
+  "aggs": {
+    "popular_payees": {
+      "terms": {
+        "field": "payee", <2>
+        "size": 5
+      }
+    }
+  }
+}
+----
+<1> First we filter our dataset to just those customers that experienced problems
+<2> Then we do a simple `terms` aggregation to find the most "popular" merchants
+
+There are 19 customers here that have reported fraudulent activity on their card.
+We first apply a `term` filter so that we can isolate their transactions from
+the rest of the transactions in the dataset.
+
+It is possible there are 19 different, independent fraudulent merchants.  But it
+is more likely there was one merchant which the majority of these 19 customers 
+shopped at.  One approach to finding this merchant may be identifying the most
+"popular" merchant shared by everyone in the group.  We do this with a `terms`
+aggregation, and get these results:
+
+[source,js]
+----
+{
+    ...
+    "aggregations": {
+        "popular_payees": {
+            "buckets": [
+                {
+                   "key": 1134790655, <1>
+                   "key_as_string": "1134790655",
+                   "doc_count": 12
+                },
+                {
+                   "key": 1954360162,
+                   "key_as_string": "1954360162",
+                   "doc_count": 12
+                },
+                {
+                   "key": 2569625,
+                   "key_as_string": "2569625",
+                   "doc_count": 9
+                },
+                {
+                   "key": 150677504,
+                   "key_as_string": "150677504",
+                   "doc_count": 5
+                },
+                {
+                   "key": 556339714,
+                   "key_as_string": "556339714",
+                   "doc_count": 5
+                }
+            ]
+        }
+    }
+}
+----
+<1> Merchant `1134790655` was used by 12 of the 19 customers, etc
+
+Ok, that's a good starting point.  We can see that two merchants, `1134790655`
+and `1954360162`, were both used by 12 of the 19 customers.  That's an encouraging
+start.  
+
+But as a sanity check, let's check how "popular" these merchants are in general:
+
+[source,js]
+----
+GET /xyzbank/transaction/_search?search_type=count
+{
+  "query": {
+    "filtered": {
+      "filter": {
+        "terms": {
+          "payee": [
+            1134790655,1954360162,2569625,150677504,556339714 <1>
+          ]
+        }
+      }
+    }
+  }, 
+  "aggs": {
+    "payees": { <2>
+      "terms": {
+        "field": "payee"
+      },
+      "aggs": {
+        "distinct_customers": { <3>
+          "cardinality": {
+            "field": "payer"
+          }
+        }
+      }
+    }
+  }
+}
+----
+<1> The filter limits our aggregations to "popular" merchants
+<2> The `terms` aggregation will show us how many transaction each merchant has
+performed
+<3> And the `cardinality` will tell us how many unique customers have shopped at
+each of those merchants
+
+This query will look at all the "popular" merchants, and calculate two important
+criteria:  how many transactions have occurred with that merchant, and how many
+customers made those transactions.  When we run the query, we see some disappointing
+results:
+
+[source,js]
+----
+...
+    "aggregations": {
+      "payees": {
+         "buckets": [
+            {
+               "key": 1954360162,
+               "key_as_string": "1954360162",
+               "doc_count": 4930,
+               "distinct_customers": {
+                  "value": 3644
+               }
+            },
+            {
+               "key": 1134790655,
+               "key_as_string": "1134790655",
+               "doc_count": 4900,
+               "distinct_customers": {
+                  "value": 4524
+               }
+            },
+            {
+               "key": 2569625,
+               "key_as_string": "2569625",
+               "doc_count": 19,
+               "distinct_customers": {
+                  "value": 8
+               }
+            },
+            {
+               "key": 556339714,
+               "key_as_string": "556339714",
+               "doc_count": 9,
+               "distinct_customers": {
+                  "value": 3
+               }
+            },
+            {
+               "key": 150677504,
+               "key_as_string": "150677504",
+               "doc_count": 5,
+               "distinct_customers": {
+                  "value": 1
+               }
+            }
+         ]
+      }
+   }
+...
+----
+
+Our two best candiates -- the most "popular" `1134790655` and `1954360162` --
+appear to be the most popular merchants for _everyone_.  They are probably
+large stores like Amazon that everyone shops at.  We can rule these merchants
+out, since it is unlikely they decided to scam 19 out of their 4000 customers.
+They are merely showing up in our analysis because their _background_ popularity
+is high.
+
+On the flipside, we can probably rule out `150677504` and `556339714`, since
+they only interacted with 1 and 3 customers respectively.  These are probably
+small shops (think your corner store) that you and few other people shop at.  If
+you go to your corner store every morning, there will be many transactions for _you_,
+but few for everyone else in the world.  It might artificially show up on a 
+most "popular" list for this reason.  The background popularity may be low,
+but the overlap with the fraudulent group is also low.
+
+Let's set up a SigTerms query and see what it has to say:
+
+[source,js]
+----
+GET /xyzbank/transaction/_search?search_type=count
+{
+   "query":{
+      "filtered": {
+        "filter": {
+          "terms": {
+            "payer": [
+              720968604,1579227144,512961472,75150786,450556979,
+              1257085164,1147825721,91398831,981907205,869569457,
+              1389973594,277221401,996175755,1625580913,143444417,
+              2013476680,1856942381,237287347,1536133807
+            ]
+          }
+        }
+      }
+   },
+   "aggregations":{
+      "payees":{
+         "significant_terms":{ <1>
+            "field":"payee",
+            "size": 5
+         }
+      }
+   }
+}
+----
+<1> Instead of a `terms` bucket, we use a `significant_terms` instead
+
+The query setup is almost identical to our "popular" query, but instead of `terms`
+we use a `signficiant_terms` bucket.  This will perform a process very similar to
+what we did manually above:
+
+1. It will find all unique terms in your _foreground_ -- the documents which are
+found by the query.  In our case, this will be all the documents that match
+the filter for `payer`
+2. The _background_ rate for each of these terms is then calculated.  How often
+are these terms used outside the fraudulent population?
+3. Finally, the background rate is compared against the foreground rate, and ranked
+in descending order.  This finds terms that are more popular in the fraudulent
+group than in the background normal transaction data.
+
+The results look like this:
+
+[source,js]
+----
+...
+    "aggregations": {
+      "payees": {
+         "doc_count": 2497, <1>
+         "buckets": [
+            {
+               "key": 2569625,
+               "key_as_string": "2569625",
+               "doc_count": 9, <2>
+               "score": 1.1096324319660162,
+               "bg_count": 15 <3>
+            },
+            {
+               "key": 131309720,
+               "key_as_string": "131309720",
+               "doc_count": 5,
+               "score": 1.0287723722612108,
+               "bg_count": 5
+            },
+            {
+               "key": 1784307987,
+               "key_as_string": "1784307987",
+               "doc_count": 5,
+               "score": 1.0287723722612108,
+               "bg_count": 5
+            },
+            {
+               "key": 150677504,
+               "key_as_string": "150677504",
+               "doc_count": 5,
+               "score": 1.0287723722612108,
+               "bg_count": 5
+            },
+            {
+               "key": 1053742706,
+               "key_as_string": "1053742706",
+               "doc_count": 5,
+               "score": 1.0287723722612108,
+               "bg_count": 5
+            }
+         ]
+      }
+   }
+   ...
+----
+<1> This represents the number of transactions in our fraudulent group
+<2> The `doc_count` for each merchant represents the popularity of this merchant
+within the fraudulent group
+<3> While the `bg_count` represents the popularity total, across the entire
+dataset
+
+The SigTerms output can be a little confusing at first, so we will walk through
+the various fields and how to interpret it.  
+
+SigTerms first tells us that 2497 transactions were performed by the customers who
+reported fraud.  Next comes a list of buckets representing merchants, ordered in 
+descending order based on how statistically anomalous they are.  
+
+The most anomalous merchant is `2569625`.  Nine of the fraudulent transaction
+were performed at this merchant, giving it a foreground rate of ~0.3% 
+(9/2497 = 0.0036).  This may seem tiny, but consider the "background" rate for
+this merchant:  0.001% (15/1285378 = 0.000011)
+
+
+
+kjsdd.sjfsdkjhfsdjkl
+
 
-TODO setup repo with financial data, restore, etc
diff --git a/520_Post_Deployment/50_backup.asciidoc b/520_Post_Deployment/50_backup.asciidoc
@@ -1,4 +1,4 @@
-
+[[backing-up-your-cluster]]
 === Backing up your Cluster
 
 Like any software that stores data, it is important to routinely backup your

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-`
	`1`	`+[[backing-up-your-cluster]]`
`2`	`2`	`=== Backing up your Cluster`
`3`	`3`
`4`	`4`	`Like any software that stores data, it is important to routinely backup your`