Skip to content

Commit b100436

Browse files
committed
attempt at sigterms example
1 parent a1b355e commit b100436

File tree

2 files changed

+370
-4
lines changed

2 files changed

+370
-4
lines changed

300_Aggregations/75_sigterms.asciidoc

Lines changed: 369 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,373 @@
11

22
=== Significant Terms Syntax
33

4-
Let's index some financial data which represents transactions between
5-
merchants and customers:
4+
Because the Significant Terms (SigTerms) aggregation works by analyzing
5+
statistics, you need to have a certain threshold of data for it to become effective.
6+
That means we won't be able to index a small amount of toy data for the demo.
7+
8+
Instead, we have a pre-prepared dataset of around 1.2m documents. This is
9+
saved as a Snapshot (for more information about Snapshot and Restore, see
10+
<<backing-up-your-cluster>>) in our public demo repository. You can "restore"
11+
this dataset into your cluster using these commands:
12+
13+
TODO -- add information about the public repo
14+
15+
Note: The dataset is around 120mb and may take some time to download.
16+
17+
Let's take a look at some sample data, to get a feel for what we are working with:
18+
19+
[source,js]
20+
----
21+
GET /xyzbank/transaction/_search <1>
22+
23+
{
24+
"took": 23,
25+
"timed_out": false,
26+
"_shards": {...},
27+
"hits": {
28+
"total": 1285378,
29+
"max_score": 1,
30+
"hits": [
31+
{
32+
"_index": "xyzbank",
33+
"_type": "transaction",
34+
"_id": "6OZ186ogTe6kz7ZQkQ5P7g",
35+
"_score": 1,
36+
"_source": {
37+
"offset": 35832,
38+
"bytes": 53,
39+
"payee": 1654865987, <2>
40+
"payer": 1946731689, <3>
41+
"randInt": 689
42+
}
43+
},
44+
----
45+
<1> Execute a search without a query, so that we can see a random sampling of docs
46+
<2> The party getting paid (e.g. the merchant)
47+
<3> The party that is paying (e.g. the customer)
48+
49+
50+
These documents represent transactions between customers and merchants. Each document
51+
has a payer and payee field, which represent the IDs of various customers/merchants.
52+
You can ignore `offset`, `bytes` and `randInt`, they are artifacts of the process
53+
used to generate the data.
54+
55+
In this demo, you are playing the role of a credit card auditor. You've been
56+
informed that a number of customers have recently been defrauded via stolen
57+
credit card numbers. You need to find the merchant (or merchants!) who are
58+
responsible.
59+
60+
Let's take a simple approach first.
61+
62+
[source,js]
63+
----
64+
GET /xyzbank/transaction/_search?search_type=count
65+
{
66+
"query": {
67+
"filtered": {
68+
"filter": {
69+
"terms": {
70+
"payer": [
71+
720968604,1579227144,512961472,75150786,450556979, <1>
72+
1257085164,1147825721,91398831,981907205,869569457,
73+
1389973594,277221401,996175755,1625580913,143444417,
74+
2013476680,1856942381,237287347,1536133807
75+
]
76+
}
77+
}
78+
}
79+
},
80+
"aggs": {
81+
"popular_payees": {
82+
"terms": {
83+
"field": "payee", <2>
84+
"size": 5
85+
}
86+
}
87+
}
88+
}
89+
----
90+
<1> First we filter our dataset to just those customers that experienced problems
91+
<2> Then we do a simple `terms` aggregation to find the most "popular" merchants
92+
93+
There are 19 customers here that have reported fraudulent activity on their card.
94+
We first apply a `term` filter so that we can isolate their transactions from
95+
the rest of the transactions in the dataset.
96+
97+
It is possible there are 19 different, independent fraudulent merchants. But it
98+
is more likely there was one merchant which the majority of these 19 customers
99+
shopped at. One approach to finding this merchant may be identifying the most
100+
"popular" merchant shared by everyone in the group. We do this with a `terms`
101+
aggregation, and get these results:
102+
103+
[source,js]
104+
----
105+
{
106+
...
107+
"aggregations": {
108+
"popular_payees": {
109+
"buckets": [
110+
{
111+
"key": 1134790655, <1>
112+
"key_as_string": "1134790655",
113+
"doc_count": 12
114+
},
115+
{
116+
"key": 1954360162,
117+
"key_as_string": "1954360162",
118+
"doc_count": 12
119+
},
120+
{
121+
"key": 2569625,
122+
"key_as_string": "2569625",
123+
"doc_count": 9
124+
},
125+
{
126+
"key": 150677504,
127+
"key_as_string": "150677504",
128+
"doc_count": 5
129+
},
130+
{
131+
"key": 556339714,
132+
"key_as_string": "556339714",
133+
"doc_count": 5
134+
}
135+
]
136+
}
137+
}
138+
}
139+
----
140+
<1> Merchant `1134790655` was used by 12 of the 19 customers, etc
141+
142+
Ok, that's a good starting point. We can see that two merchants, `1134790655`
143+
and `1954360162`, were both used by 12 of the 19 customers. That's an encouraging
144+
start.
145+
146+
But as a sanity check, let's check how "popular" these merchants are in general:
147+
148+
[source,js]
149+
----
150+
GET /xyzbank/transaction/_search?search_type=count
151+
{
152+
"query": {
153+
"filtered": {
154+
"filter": {
155+
"terms": {
156+
"payee": [
157+
1134790655,1954360162,2569625,150677504,556339714 <1>
158+
]
159+
}
160+
}
161+
}
162+
},
163+
"aggs": {
164+
"payees": { <2>
165+
"terms": {
166+
"field": "payee"
167+
},
168+
"aggs": {
169+
"distinct_customers": { <3>
170+
"cardinality": {
171+
"field": "payer"
172+
}
173+
}
174+
}
175+
}
176+
}
177+
}
178+
----
179+
<1> The filter limits our aggregations to "popular" merchants
180+
<2> The `terms` aggregation will show us how many transaction each merchant has
181+
performed
182+
<3> And the `cardinality` will tell us how many unique customers have shopped at
183+
each of those merchants
184+
185+
This query will look at all the "popular" merchants, and calculate two important
186+
criteria: how many transactions have occurred with that merchant, and how many
187+
customers made those transactions. When we run the query, we see some disappointing
188+
results:
189+
190+
[source,js]
191+
----
192+
...
193+
"aggregations": {
194+
"payees": {
195+
"buckets": [
196+
{
197+
"key": 1954360162,
198+
"key_as_string": "1954360162",
199+
"doc_count": 4930,
200+
"distinct_customers": {
201+
"value": 3644
202+
}
203+
},
204+
{
205+
"key": 1134790655,
206+
"key_as_string": "1134790655",
207+
"doc_count": 4900,
208+
"distinct_customers": {
209+
"value": 4524
210+
}
211+
},
212+
{
213+
"key": 2569625,
214+
"key_as_string": "2569625",
215+
"doc_count": 19,
216+
"distinct_customers": {
217+
"value": 8
218+
}
219+
},
220+
{
221+
"key": 556339714,
222+
"key_as_string": "556339714",
223+
"doc_count": 9,
224+
"distinct_customers": {
225+
"value": 3
226+
}
227+
},
228+
{
229+
"key": 150677504,
230+
"key_as_string": "150677504",
231+
"doc_count": 5,
232+
"distinct_customers": {
233+
"value": 1
234+
}
235+
}
236+
]
237+
}
238+
}
239+
...
240+
----
241+
242+
Our two best candiates -- the most "popular" `1134790655` and `1954360162` --
243+
appear to be the most popular merchants for _everyone_. They are probably
244+
large stores like Amazon that everyone shops at. We can rule these merchants
245+
out, since it is unlikely they decided to scam 19 out of their 4000 customers.
246+
They are merely showing up in our analysis because their _background_ popularity
247+
is high.
248+
249+
On the flipside, we can probably rule out `150677504` and `556339714`, since
250+
they only interacted with 1 and 3 customers respectively. These are probably
251+
small shops (think your corner store) that you and few other people shop at. If
252+
you go to your corner store every morning, there will be many transactions for _you_,
253+
but few for everyone else in the world. It might artificially show up on a
254+
most "popular" list for this reason. The background popularity may be low,
255+
but the overlap with the fraudulent group is also low.
256+
257+
Let's set up a SigTerms query and see what it has to say:
258+
259+
[source,js]
260+
----
261+
GET /xyzbank/transaction/_search?search_type=count
262+
{
263+
"query":{
264+
"filtered": {
265+
"filter": {
266+
"terms": {
267+
"payer": [
268+
720968604,1579227144,512961472,75150786,450556979,
269+
1257085164,1147825721,91398831,981907205,869569457,
270+
1389973594,277221401,996175755,1625580913,143444417,
271+
2013476680,1856942381,237287347,1536133807
272+
]
273+
}
274+
}
275+
}
276+
},
277+
"aggregations":{
278+
"payees":{
279+
"significant_terms":{ <1>
280+
"field":"payee",
281+
"size": 5
282+
}
283+
}
284+
}
285+
}
286+
----
287+
<1> Instead of a `terms` bucket, we use a `significant_terms` instead
288+
289+
The query setup is almost identical to our "popular" query, but instead of `terms`
290+
we use a `signficiant_terms` bucket. This will perform a process very similar to
291+
what we did manually above:
292+
293+
1. It will find all unique terms in your _foreground_ -- the documents which are
294+
found by the query. In our case, this will be all the documents that match
295+
the filter for `payer`
296+
2. The _background_ rate for each of these terms is then calculated. How often
297+
are these terms used outside the fraudulent population?
298+
3. Finally, the background rate is compared against the foreground rate, and ranked
299+
in descending order. This finds terms that are more popular in the fraudulent
300+
group than in the background normal transaction data.
301+
302+
The results look like this:
303+
304+
[source,js]
305+
----
306+
...
307+
"aggregations": {
308+
"payees": {
309+
"doc_count": 2497, <1>
310+
"buckets": [
311+
{
312+
"key": 2569625,
313+
"key_as_string": "2569625",
314+
"doc_count": 9, <2>
315+
"score": 1.1096324319660162,
316+
"bg_count": 15 <3>
317+
},
318+
{
319+
"key": 131309720,
320+
"key_as_string": "131309720",
321+
"doc_count": 5,
322+
"score": 1.0287723722612108,
323+
"bg_count": 5
324+
},
325+
{
326+
"key": 1784307987,
327+
"key_as_string": "1784307987",
328+
"doc_count": 5,
329+
"score": 1.0287723722612108,
330+
"bg_count": 5
331+
},
332+
{
333+
"key": 150677504,
334+
"key_as_string": "150677504",
335+
"doc_count": 5,
336+
"score": 1.0287723722612108,
337+
"bg_count": 5
338+
},
339+
{
340+
"key": 1053742706,
341+
"key_as_string": "1053742706",
342+
"doc_count": 5,
343+
"score": 1.0287723722612108,
344+
"bg_count": 5
345+
}
346+
]
347+
}
348+
}
349+
...
350+
----
351+
<1> This represents the number of transactions in our fraudulent group
352+
<2> The `doc_count` for each merchant represents the popularity of this merchant
353+
within the fraudulent group
354+
<3> While the `bg_count` represents the popularity total, across the entire
355+
dataset
356+
357+
The SigTerms output can be a little confusing at first, so we will walk through
358+
the various fields and how to interpret it.
359+
360+
SigTerms first tells us that 2497 transactions were performed by the customers who
361+
reported fraud. Next comes a list of buckets representing merchants, ordered in
362+
descending order based on how statistically anomalous they are.
363+
364+
The most anomalous merchant is `2569625`. Nine of the fraudulent transaction
365+
were performed at this merchant, giving it a foreground rate of ~0.3%
366+
(9/2497 = 0.0036). This may seem tiny, but consider the "background" rate for
367+
this merchant: 0.001% (15/1285378 = 0.000011)
368+
369+
370+
371+
kjsdd.sjfsdkjhfsdjkl
372+
6373

7-
TODO setup repo with financial data, restore, etc

520_Post_Deployment/50_backup.asciidoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
1+
[[backing-up-your-cluster]]
22
=== Backing up your Cluster
33

44
Like any software that stores data, it is important to routinely backup your

0 commit comments

Comments
 (0)