Skip to content

Commit

Permalink
URIDrain fixes (#2)
Browse files Browse the repository at this point in the history
  • Loading branch information
Superskyyy authored Jun 17, 2023
1 parent fd3d4f4 commit 0e19c8f
Show file tree
Hide file tree
Showing 19 changed files with 3,672 additions and 613 deletions.
1 change: 1 addition & 0 deletions .licenserc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,6 @@ header:
- '.git*'
- '.idea*'
- 'models' # TODO: should be changed to drain3-related only, match not working on windows (\\) vs (/)?
- '**/*.proto'

comment: on-failure
7 changes: 6 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,12 @@ else
OS := $(shell sh -c 'uname 2>/dev/null || echo Unknown')
endif

.PHONY: all
gen:
poetry run python -m tools.grpc_gen

.PHONY: env
env: poetry
env: poetry gen
poetry install --all-extras
poetry run pip install --upgrade pip

Expand All @@ -43,6 +47,7 @@ ifeq ($(OS),Windows)
poetry self update
else
-curl -sSL https://install.python-poetry.org | python3 -
export PATH="$HOME/.local/bin:$PATH"
poetry self update || $(MAKE) poetry-fallback
endif

Expand Down
45 changes: 44 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

RESTful Pattern Recognition(R3) for Apache SkyWalking AI pipeline.

**IMPORTANT** A baseline dataset for verifying the changes of the algorithm is needed. The algorithm
development process doesn't ensure the correctness of the algorithm until the dataset is ready.


### Demo

To run a demo of the algorithm (integration is pending to SkyWalking):
Expand Down Expand Up @@ -39,4 +43,43 @@ the URI domain. Which includes:
3. The URIDrain algorithm doesn't involve pre-masking of the URI sequences to prevent false assumptions.
4. The URIDrain algorithm takes preceding and subsequent URI tokens into account when deciding if a matched cluster
should be updated.
5. **TODO**: The URIDrain algorithm optionally use English Corpus to help identify likely non-parameter tokens.
5. **TODO**: The URIDrain algorithm optionally use English Corpus to help identify likely non-parameter tokens.

**Known Caveats**:
The algorithm may provide false clustering in some edge cases (although it doesn't hurt at all in APM scenarios).
The caveat is led by the fact that some different endpoints may contain common params accidentally
in extremely rare cases (When the incoming sequence order is in bad luck). Example:
```text
cur_sim = 0.25 for cluster 1, cluster.log_template_tokens = ('api', 'v2', 'customers', 'xyz789') with param count = 0
cur_sim = 0.5 for cluster 3, cluster.log_template_tokens = ('api', 'v1', 'projects', '<:VAR:>') with param count = 1
cur_sim = 0.75 for cluster 7, cluster.log_template_tokens = ('api', 'v1', 'wallets', 'abcdef') with param count = 0
cur_sim = 0.5 for cluster 8, cluster.log_template_tokens = ('api', 'v1', 'bills', 'abcxyz') with param count = 0
cur_sim = 0.5 for cluster 9, cluster.log_template_tokens = ('api', 'v1', 'services', 'abc456') with param count = 0
cur_sim = 0.75 for cluster 12, cluster.log_template_tokens = ('api', 'v1', 'companies', 'xyz123') with param count = 0
seq1 (incoming uri) = api/v1/companies/abcdef
seq2 (matched template) = api/v1/wallets/abcdef
```
another example
```text
seq1 (matched 1) = api/v1/haha/456/anotherpath2
seq1 (incoming uri) = api/v1/haha/456/actualpath1
seq2 (matched 2) = api/v1/haha/123/actualpath1
```
This can be mitigated with NLTK to detect which tokens are likely to be parameters. This is not implemented yet.

General rule is: do not trust a template that only have size 1 and has no params identified: it's likely to be a false classification.

TODO: Add postprocessing for such single templates (it's single because algorithm has preference for correct template with param count.
(IFF template is size 1 and has no params and is almost identical to another template, merge them)

### Integration
This project rely on gRPC to communicate with the Apache SkyWalking AI pipeline. The gRPC service definition can be found
in the `server/proto/' folder.

Compile the proto by running `make gen` or simply `make env` if you are get started from a bare environment.

### TODO
Try catch statements to handle uncovered algorithm errors

2 changes: 1 addition & 1 deletion demo/Endpoint100_trivial.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
/api/v999/orders/def456def456def456def456def456def456def456def456def456def456def456def456
/api/v999/orders/abcxyzdef456def456def456def456def456
/api/v999/orders/123abc456
/api/v777/orders/123abc456
/api/v777/orders/123abc456-thing-only-appear-once-must-be-handled-by-nltk-in-v2
/api/v1/accounts/123
/api/v1/invoices/abc
/api/v1/accounts/xyz
Expand Down
17 changes: 11 additions & 6 deletions demo/Endpoint200_hard.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
/api/v1/companies/789
/api/v1/companies/abc
/api/v1/companies/def
/api/v1/companies/ghi
/api/v1/companies/ghig
/api/v1/companies/123abc
/api/v1/companies/xyz123
/api/v1/companies/abc456
Expand All @@ -17,7 +17,7 @@
/api/v1/projects/789
/api/v1/projects/abc
/api/v1/projects/def
/api/v1/projects/ghi
/api/v1/projects/ghibb
/api/v1/projects/123abc
/api/v1/projects/xyz123
/api/v1/projects/abc456
Expand Down Expand Up @@ -86,17 +86,21 @@
/customer/hahaha0917/profile/121033/compare/hahaha0297/profile/12105
/customer/hahaha0927/profile/121
/api/v1/users/123/posts/456/comments/789
/api/v1/users/abc/posts/def/comments/ghi
/api/v1/users/abc/posts/def/comments/ghihaha
/api/v1/companies/123/employees/456/reviews/789
/api/v1/companies/abc/employees/def/reviews/ghi
/api/v1/companies/abc/employees/def/reviews/9971029019
/api/v1/companies/123/tasks/456/assignees/789
/api/v1/companies/abc/tasks/def/assignees/ghi
/api/v1/companies/abc/tasks/def/assignees/ghibbb
/this-is-illegalstyleendpoint/1112/1231313/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/this-is-illegalstyleendpoint/1112/1231313/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/this-is-illegalstyleendpoint/1112/1231313/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/this-is-illegalstyleendpoint/2222/2222222/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/api/v1/users/123/posts/456/comments
/api/v1/users/789/posts/321/comments
/api/v1/users/abc/posts/def/comments
/api/v1/users/xyz/posts/mno/comments
/api/v1/users/101/posts/102/comments
/api/v1/users/ghi/posts/jkl/comments
/api/v1/users/g213hi/posts/jkl/comments
/api/v1/users/pqr/posts/stu/comments
/api/v1/users/111/posts/222/comments
/api/v1/users/333/posts/444/comments
Expand Down Expand Up @@ -159,4 +163,5 @@ google.com/api/v1/users/123
tmall.cn/api/v1/users/904
special.helloword.com/api/v1/users/123ada
top1.abc.example.com.net.cn/api/v1/users/badwdw
top1.abc.example.com.net.cn/api/v1/users/this-url-is-special-since-domain-hasdigits
GET:/api/v2/users/1222222223/similar-to-these-single-occurrence-is-not-handled-correctly-will-be-handled-by-nltk!!!!!
Loading

0 comments on commit 0e19c8f

Please sign in to comment.