|
1 | 1 |
|
2 | 2 | === Significant Terms Syntax
|
3 | 3 |
|
4 |
| -Let's index some financial data which represents transactions between |
5 |
| -merchants and customers: |
| 4 | +Because the Significant Terms (SigTerms) aggregation works by analyzing |
| 5 | +statistics, you need to have a certain threshold of data for it to become effective. |
| 6 | +That means we won't be able to index a small amount of toy data for the demo. |
| 7 | + |
| 8 | +Instead, we have a pre-prepared dataset of around 1.2m documents. This is |
| 9 | +saved as a Snapshot (for more information about Snapshot and Restore, see |
| 10 | +<<backing-up-your-cluster>>) in our public demo repository. You can "restore" |
| 11 | +this dataset into your cluster using these commands: |
| 12 | + |
| 13 | +TODO -- add information about the public repo |
| 14 | + |
| 15 | +Note: The dataset is around 120mb and may take some time to download. |
| 16 | + |
| 17 | +Let's take a look at some sample data, to get a feel for what we are working with: |
| 18 | + |
| 19 | +[source,js] |
| 20 | +---- |
| 21 | +GET /xyzbank/transaction/_search <1> |
| 22 | +
|
| 23 | +{ |
| 24 | + "took": 23, |
| 25 | + "timed_out": false, |
| 26 | + "_shards": {...}, |
| 27 | + "hits": { |
| 28 | + "total": 1285378, |
| 29 | + "max_score": 1, |
| 30 | + "hits": [ |
| 31 | + { |
| 32 | + "_index": "xyzbank", |
| 33 | + "_type": "transaction", |
| 34 | + "_id": "6OZ186ogTe6kz7ZQkQ5P7g", |
| 35 | + "_score": 1, |
| 36 | + "_source": { |
| 37 | + "offset": 35832, |
| 38 | + "bytes": 53, |
| 39 | + "payee": 1654865987, <2> |
| 40 | + "payer": 1946731689, <3> |
| 41 | + "randInt": 689 |
| 42 | + } |
| 43 | + }, |
| 44 | +---- |
| 45 | +<1> Execute a search without a query, so that we can see a random sampling of docs |
| 46 | +<2> The party getting paid (e.g. the merchant) |
| 47 | +<3> The party that is paying (e.g. the customer) |
| 48 | + |
| 49 | + |
| 50 | +These documents represent transactions between customers and merchants. Each document |
| 51 | +has a payer and payee field, which represent the IDs of various customers/merchants. |
| 52 | +You can ignore `offset`, `bytes` and `randInt`, they are artifacts of the process |
| 53 | +used to generate the data. |
| 54 | + |
| 55 | +In this demo, you are playing the role of a credit card auditor. You've been |
| 56 | +informed that a number of customers have recently been defrauded via stolen |
| 57 | +credit card numbers. You need to find the merchant (or merchants!) who are |
| 58 | +responsible. |
| 59 | + |
| 60 | +Let's take a simple approach first. |
| 61 | + |
| 62 | +[source,js] |
| 63 | +---- |
| 64 | +GET /xyzbank/transaction/_search?search_type=count |
| 65 | +{ |
| 66 | + "query": { |
| 67 | + "filtered": { |
| 68 | + "filter": { |
| 69 | + "terms": { |
| 70 | + "payer": [ |
| 71 | + 720968604,1579227144,512961472,75150786,450556979, <1> |
| 72 | + 1257085164,1147825721,91398831,981907205,869569457, |
| 73 | + 1389973594,277221401,996175755,1625580913,143444417, |
| 74 | + 2013476680,1856942381,237287347,1536133807 |
| 75 | + ] |
| 76 | + } |
| 77 | + } |
| 78 | + } |
| 79 | + }, |
| 80 | + "aggs": { |
| 81 | + "popular_payees": { |
| 82 | + "terms": { |
| 83 | + "field": "payee", <2> |
| 84 | + "size": 5 |
| 85 | + } |
| 86 | + } |
| 87 | + } |
| 88 | +} |
| 89 | +---- |
| 90 | +<1> First we filter our dataset to just those customers that experienced problems |
| 91 | +<2> Then we do a simple `terms` aggregation to find the most "popular" merchants |
| 92 | + |
| 93 | +There are 19 customers here that have reported fraudulent activity on their card. |
| 94 | +We first apply a `term` filter so that we can isolate their transactions from |
| 95 | +the rest of the transactions in the dataset. |
| 96 | + |
| 97 | +It is possible there are 19 different, independent fraudulent merchants. But it |
| 98 | +is more likely there was one merchant which the majority of these 19 customers |
| 99 | +shopped at. One approach to finding this merchant may be identifying the most |
| 100 | +"popular" merchant shared by everyone in the group. We do this with a `terms` |
| 101 | +aggregation, and get these results: |
| 102 | + |
| 103 | +[source,js] |
| 104 | +---- |
| 105 | +{ |
| 106 | + ... |
| 107 | + "aggregations": { |
| 108 | + "popular_payees": { |
| 109 | + "buckets": [ |
| 110 | + { |
| 111 | + "key": 1134790655, <1> |
| 112 | + "key_as_string": "1134790655", |
| 113 | + "doc_count": 12 |
| 114 | + }, |
| 115 | + { |
| 116 | + "key": 1954360162, |
| 117 | + "key_as_string": "1954360162", |
| 118 | + "doc_count": 12 |
| 119 | + }, |
| 120 | + { |
| 121 | + "key": 2569625, |
| 122 | + "key_as_string": "2569625", |
| 123 | + "doc_count": 9 |
| 124 | + }, |
| 125 | + { |
| 126 | + "key": 150677504, |
| 127 | + "key_as_string": "150677504", |
| 128 | + "doc_count": 5 |
| 129 | + }, |
| 130 | + { |
| 131 | + "key": 556339714, |
| 132 | + "key_as_string": "556339714", |
| 133 | + "doc_count": 5 |
| 134 | + } |
| 135 | + ] |
| 136 | + } |
| 137 | + } |
| 138 | +} |
| 139 | +---- |
| 140 | +<1> Merchant `1134790655` was used by 12 of the 19 customers, etc |
| 141 | + |
| 142 | +Ok, that's a good starting point. We can see that two merchants, `1134790655` |
| 143 | +and `1954360162`, were both used by 12 of the 19 customers. That's an encouraging |
| 144 | +start. |
| 145 | + |
| 146 | +But as a sanity check, let's check how "popular" these merchants are in general: |
| 147 | + |
| 148 | +[source,js] |
| 149 | +---- |
| 150 | +GET /xyzbank/transaction/_search?search_type=count |
| 151 | +{ |
| 152 | + "query": { |
| 153 | + "filtered": { |
| 154 | + "filter": { |
| 155 | + "terms": { |
| 156 | + "payee": [ |
| 157 | + 1134790655,1954360162,2569625,150677504,556339714 <1> |
| 158 | + ] |
| 159 | + } |
| 160 | + } |
| 161 | + } |
| 162 | + }, |
| 163 | + "aggs": { |
| 164 | + "payees": { <2> |
| 165 | + "terms": { |
| 166 | + "field": "payee" |
| 167 | + }, |
| 168 | + "aggs": { |
| 169 | + "distinct_customers": { <3> |
| 170 | + "cardinality": { |
| 171 | + "field": "payer" |
| 172 | + } |
| 173 | + } |
| 174 | + } |
| 175 | + } |
| 176 | + } |
| 177 | +} |
| 178 | +---- |
| 179 | +<1> The filter limits our aggregations to "popular" merchants |
| 180 | +<2> The `terms` aggregation will show us how many transaction each merchant has |
| 181 | +performed |
| 182 | +<3> And the `cardinality` will tell us how many unique customers have shopped at |
| 183 | +each of those merchants |
| 184 | + |
| 185 | +This query will look at all the "popular" merchants, and calculate two important |
| 186 | +criteria: how many transactions have occurred with that merchant, and how many |
| 187 | +customers made those transactions. When we run the query, we see some disappointing |
| 188 | +results: |
| 189 | + |
| 190 | +[source,js] |
| 191 | +---- |
| 192 | +... |
| 193 | + "aggregations": { |
| 194 | + "payees": { |
| 195 | + "buckets": [ |
| 196 | + { |
| 197 | + "key": 1954360162, |
| 198 | + "key_as_string": "1954360162", |
| 199 | + "doc_count": 4930, |
| 200 | + "distinct_customers": { |
| 201 | + "value": 3644 |
| 202 | + } |
| 203 | + }, |
| 204 | + { |
| 205 | + "key": 1134790655, |
| 206 | + "key_as_string": "1134790655", |
| 207 | + "doc_count": 4900, |
| 208 | + "distinct_customers": { |
| 209 | + "value": 4524 |
| 210 | + } |
| 211 | + }, |
| 212 | + { |
| 213 | + "key": 2569625, |
| 214 | + "key_as_string": "2569625", |
| 215 | + "doc_count": 19, |
| 216 | + "distinct_customers": { |
| 217 | + "value": 8 |
| 218 | + } |
| 219 | + }, |
| 220 | + { |
| 221 | + "key": 556339714, |
| 222 | + "key_as_string": "556339714", |
| 223 | + "doc_count": 9, |
| 224 | + "distinct_customers": { |
| 225 | + "value": 3 |
| 226 | + } |
| 227 | + }, |
| 228 | + { |
| 229 | + "key": 150677504, |
| 230 | + "key_as_string": "150677504", |
| 231 | + "doc_count": 5, |
| 232 | + "distinct_customers": { |
| 233 | + "value": 1 |
| 234 | + } |
| 235 | + } |
| 236 | + ] |
| 237 | + } |
| 238 | + } |
| 239 | +... |
| 240 | +---- |
| 241 | + |
| 242 | +Our two best candiates -- the most "popular" `1134790655` and `1954360162` -- |
| 243 | +appear to be the most popular merchants for _everyone_. They are probably |
| 244 | +large stores like Amazon that everyone shops at. We can rule these merchants |
| 245 | +out, since it is unlikely they decided to scam 19 out of their 4000 customers. |
| 246 | +They are merely showing up in our analysis because their _background_ popularity |
| 247 | +is high. |
| 248 | + |
| 249 | +On the flipside, we can probably rule out `150677504` and `556339714`, since |
| 250 | +they only interacted with 1 and 3 customers respectively. These are probably |
| 251 | +small shops (think your corner store) that you and few other people shop at. If |
| 252 | +you go to your corner store every morning, there will be many transactions for _you_, |
| 253 | +but few for everyone else in the world. It might artificially show up on a |
| 254 | +most "popular" list for this reason. The background popularity may be low, |
| 255 | +but the overlap with the fraudulent group is also low. |
| 256 | + |
| 257 | +Let's set up a SigTerms query and see what it has to say: |
| 258 | + |
| 259 | +[source,js] |
| 260 | +---- |
| 261 | +GET /xyzbank/transaction/_search?search_type=count |
| 262 | +{ |
| 263 | + "query":{ |
| 264 | + "filtered": { |
| 265 | + "filter": { |
| 266 | + "terms": { |
| 267 | + "payer": [ |
| 268 | + 720968604,1579227144,512961472,75150786,450556979, |
| 269 | + 1257085164,1147825721,91398831,981907205,869569457, |
| 270 | + 1389973594,277221401,996175755,1625580913,143444417, |
| 271 | + 2013476680,1856942381,237287347,1536133807 |
| 272 | + ] |
| 273 | + } |
| 274 | + } |
| 275 | + } |
| 276 | + }, |
| 277 | + "aggregations":{ |
| 278 | + "payees":{ |
| 279 | + "significant_terms":{ <1> |
| 280 | + "field":"payee", |
| 281 | + "size": 5 |
| 282 | + } |
| 283 | + } |
| 284 | + } |
| 285 | +} |
| 286 | +---- |
| 287 | +<1> Instead of a `terms` bucket, we use a `significant_terms` instead |
| 288 | + |
| 289 | +The query setup is almost identical to our "popular" query, but instead of `terms` |
| 290 | +we use a `signficiant_terms` bucket. This will perform a process very similar to |
| 291 | +what we did manually above: |
| 292 | + |
| 293 | +1. It will find all unique terms in your _foreground_ -- the documents which are |
| 294 | +found by the query. In our case, this will be all the documents that match |
| 295 | +the filter for `payer` |
| 296 | +2. The _background_ rate for each of these terms is then calculated. How often |
| 297 | +are these terms used outside the fraudulent population? |
| 298 | +3. Finally, the background rate is compared against the foreground rate, and ranked |
| 299 | +in descending order. This finds terms that are more popular in the fraudulent |
| 300 | +group than in the background normal transaction data. |
| 301 | + |
| 302 | +The results look like this: |
| 303 | + |
| 304 | +[source,js] |
| 305 | +---- |
| 306 | +... |
| 307 | + "aggregations": { |
| 308 | + "payees": { |
| 309 | + "doc_count": 2497, <1> |
| 310 | + "buckets": [ |
| 311 | + { |
| 312 | + "key": 2569625, |
| 313 | + "key_as_string": "2569625", |
| 314 | + "doc_count": 9, <2> |
| 315 | + "score": 1.1096324319660162, |
| 316 | + "bg_count": 15 <3> |
| 317 | + }, |
| 318 | + { |
| 319 | + "key": 131309720, |
| 320 | + "key_as_string": "131309720", |
| 321 | + "doc_count": 5, |
| 322 | + "score": 1.0287723722612108, |
| 323 | + "bg_count": 5 |
| 324 | + }, |
| 325 | + { |
| 326 | + "key": 1784307987, |
| 327 | + "key_as_string": "1784307987", |
| 328 | + "doc_count": 5, |
| 329 | + "score": 1.0287723722612108, |
| 330 | + "bg_count": 5 |
| 331 | + }, |
| 332 | + { |
| 333 | + "key": 150677504, |
| 334 | + "key_as_string": "150677504", |
| 335 | + "doc_count": 5, |
| 336 | + "score": 1.0287723722612108, |
| 337 | + "bg_count": 5 |
| 338 | + }, |
| 339 | + { |
| 340 | + "key": 1053742706, |
| 341 | + "key_as_string": "1053742706", |
| 342 | + "doc_count": 5, |
| 343 | + "score": 1.0287723722612108, |
| 344 | + "bg_count": 5 |
| 345 | + } |
| 346 | + ] |
| 347 | + } |
| 348 | + } |
| 349 | + ... |
| 350 | +---- |
| 351 | +<1> This represents the number of transactions in our fraudulent group |
| 352 | +<2> The `doc_count` for each merchant represents the popularity of this merchant |
| 353 | +within the fraudulent group |
| 354 | +<3> While the `bg_count` represents the popularity total, across the entire |
| 355 | +dataset |
| 356 | + |
| 357 | +The SigTerms output can be a little confusing at first, so we will walk through |
| 358 | +the various fields and how to interpret it. |
| 359 | + |
| 360 | +SigTerms first tells us that 2497 transactions were performed by the customers who |
| 361 | +reported fraud. Next comes a list of buckets representing merchants, ordered in |
| 362 | +descending order based on how statistically anomalous they are. |
| 363 | + |
| 364 | +The most anomalous merchant is `2569625`. Nine of the fraudulent transaction |
| 365 | +were performed at this merchant, giving it a foreground rate of ~0.3% |
| 366 | +(9/2497 = 0.0036). This may seem tiny, but consider the "background" rate for |
| 367 | +this merchant: 0.001% (15/1285378 = 0.000011) |
| 368 | + |
| 369 | + |
| 370 | + |
| 371 | +kjsdd.sjfsdkjhfsdjkl |
| 372 | + |
6 | 373 |
|
7 |
| -TODO setup repo with financial data, restore, etc |
|
0 commit comments