Skip to content

Commit

Permalink
Improve documentation and fix some version URI error (#21)
Browse files Browse the repository at this point in the history
  • Loading branch information
mrproliu authored Sep 4, 2024
1 parent 3c533d9 commit dd89e88
Show file tree
Hide file tree
Showing 5 changed files with 67 additions and 16 deletions.
56 changes: 54 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,6 @@ Currently, R3 offers a simple gRPC service that could be deployed easily at loca

The simple server is the best way to get started, which could steadily serve 500+ SkyWalking services * 3000 uris per minute).

TODO: Fault tolerence and persistence is not implemented yet.

To run the R3 service on localhost:

```bash
Expand All @@ -39,6 +37,60 @@ To deploy as a container:
docker run -d --name r3 -p 17128:17128 r3:latest
```

### Demo

#### Restful Pattern Recognition

The following URL would recognize the pattern as `/api/users/{var}`, since the last part of URL are different for each instance.

* /api/users/cbf11b02ea464447b507e8852c32190a
* /api/users/5e363a4a18b7464b8cbff1a7ee4c91ca
* /api/users/44cf77fc351f4c6c9c4f1448f2f12800
* /api/users/38d3be5f9bd44f7f98906ea049694511
* /api/users/5ad14302e7924f4aa1d60e58d65b3dd2

#### Word Detection

The following URL would keep the original URL, not parametrized, since the all part of URL are word.

* /api/sale
* /api/product_sale
* /api/ProductSale

#### Lower Sample Count

The following URL would keep the original URL, not parametrized, since the sample count is lower than the threshold(`combine_min_url_count`).
If the sample count equals or bigger than the threshold, the URL would be parametrized.

Such as the threshold is `3`, the following URL would keep the original URL, not parametrized.

* /api/fetch1
* /api/fetch2

But the following URL would be parametrized to `/api/{var}`, since the sample count is bigger than the threshold.

* /api/fetch1
* /api/fetch2
* /api/fetch3

#### Versioned API

If the part of URI contains version number, such as `v1`, `v2`, `v3`, the version number part would not be parametrized.

Such as the following URL would not be parametrized:

* /test/v1
* /test/v2
* /test/v3

If still not matter the other part of URI to be parametrized, such as the following URI would be parametrized to `/test/v1/{var}` and `/test/v999/{var}`.

* /test/v1/cbf11b02ea464447b507e8852c32190a
* /test/v1/5e363a4a18b7464b8cbff1a7ee4c91ca
* /test/v1/38d3be5f9bd44f7f98906ea049694511
* /test/v999/1
* /test/v999/2
* /test/v999/3

### Algorithm: URIDrain
If you are curious how the algorithm actually works or decided to improve upon it, please first read the [URIDrain Overview](models/README.md) and checkout the algorithm live demo by running below commands:
Expand Down
2 changes: 1 addition & 1 deletion demo/uri_drain.ini
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ depth = 4
max_children = 100
max_clusters = 1024
extra_delimiters = ["/"]
combine_min_url_count = ${DRAIN_COMBINE_MIN_URL_COUNT:8}
combine_min_url_count = ${DRAIN_COMBINE_MIN_URL_COUNT:3}

[PROFILING]
enabled = True
Expand Down
2 changes: 1 addition & 1 deletion models/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Drain is the core algorithm of URI Drain.
| max_clusters | int | DRAIN_MAX_CLUSTERS | 1024 | Max number of tracked clusters (unlimited by default). When this number is reached, model starts replacing old clusters with a new ones according to the LRU policy. |
| extra_delimiters | string | DRAIN_EXTRA_DELIMITERS | \["/"\] | The extra delimiters to split the sequence. |
| analysis_min_url_count | int | DRAIN_ANALYSIS_MIN_URL_COUNT | 20 | The minimum number of unique URLs(each service) to trigger the analysis. |
| combine_min_url_count | int | DRAIN_COMBINE_MIN_URL_COUNT | 8 | The minimum number of unique URLs(candidate of each service) to mask as variable URL(encase some similar URL are not restful, such as `/test/one` and `test/two`). |
| combine_min_url_count | int | DRAIN_COMBINE_MIN_URL_COUNT | 3 | The minimum number of unique URLs(candidate of each service) to mask as variable URL(encase some similar URL are not restful, such as `/test/one` and `test/two`). |

### Profiling

Expand Down
7 changes: 2 additions & 5 deletions models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ the URI domain. Which includes:
3. The URIDrain algorithm doesn't involve pre-masking of the URI sequences to prevent false assumptions.
4. The URIDrain algorithm takes preceding and subsequent URI tokens into account when deciding if a matched cluster
should be updated.
5. **TODO**: The URIDrain algorithm optionally use English Corpus to help identify likely non-parameter tokens.
5. The URIDrain algorithm use [English Corpus](https://github.com/sloria/TextBlob) to help identify likely non-parameter tokens.
6. The URIDrain algorithm support recognized versioned API(`v\d+`) detection to prevent versioned APIs parametrized.

**Known Caveats**:
The algorithm may provide false clustering in some edge cases (although it doesn't hurt at all in APM scenarios).
Expand Down Expand Up @@ -64,7 +65,3 @@ This project rely on gRPC to communicate with the Apache SkyWalking AI pipeline.
in the `server/proto/' folder.

Compile the proto by running `make gen` or simply `make env` if you are get started from a bare environment.

### TODO
Try catch statements to handle uncovered algorithm errors

16 changes: 9 additions & 7 deletions models/uri_drain/uri_drain.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,17 +620,19 @@ def create_template(self, seq1, seq2):
# self.logger.debug(f'tokens of sequence2 = {seq2}')
return "rejected"
# ASSUMPTION: A subsequent token to version number cannot be a param
if pre_token is not None and pre_token.startswith(
'v') and pre_token[1:].isdigit():
# self.logger.debug('pre_token is a version number, so current token cannot be a param (assumption)')
# self.logger.debug(f'tokens of sequence2 = {seq2}')
return "rejected"
# This one should be deleted because we should permit the an param path is after version number path
# such as /test/v1/abcdef, /test/v1/bcdefg, should be merged into /test/v1/{var}
# if pre_token is not None and pre_token.startswith(
# 'v') and pre_token[1:].isdigit():
# # self.logger.debug('pre_token is a version number, so current token cannot be a param (assumption)')
# # self.logger.debug(f'tokens of sequence2 = {seq2}')
# return "rejected"
if token1.startswith('v') and token1[1:].isdigit():
# self.logger.debug('token1 is a version number, so current token cannot be a param (assumption)')
# self.logger.debug(f'tokens of sequence2 = {seq2}')
return "rejected"
if pre_token and self.has_numbers(pre_token):
# Based on assumption that no two consecutive tokens can be params
if pre_token and (not pre_token.startswith('v')) and self.has_numbers(pre_token):
# Based on assumption that no two consecutive tokens can be params(unless the pre token is versioned)
# So attempt to change this position must ensure that the previous token is not a param
# self.logger.debug('pre_token has numbers, so current token cannot be a param (assumption)')
# self.logger.debug(f'tokens of sequence2 = {seq2}')
Expand Down

0 comments on commit dd89e88

Please sign in to comment.