Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement grpc service for URIDrain #3

Merged
merged 5 commits into from
Jun 20, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
URIDrain fixes
  • Loading branch information
Superskyyy committed Jun 17, 2023
commit ccf0f3c970bc212b066e42ff01ac7bf7c2649021
7 changes: 6 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,12 @@ else
OS := $(shell sh -c 'uname 2>/dev/null || echo Unknown')
endif

.PHONY: all
gen:
poetry run python -m tools.grpc_gen

.PHONY: env
env: poetry
env: poetry gen
poetry install --all-extras
poetry run pip install --upgrade pip

Expand All @@ -43,6 +47,7 @@ ifeq ($(OS),Windows)
poetry self update
else
-curl -sSL https://install.python-poetry.org | python3 -
export PATH="$HOME/.local/bin:$PATH"
poetry self update || $(MAKE) poetry-fallback
endif

Expand Down
45 changes: 44 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

RESTful Pattern Recognition(R3) for Apache SkyWalking AI pipeline.

**IMPORTANT** A baseline dataset for verifying the changes of the algorithm is needed. The algorithm
development process doesn't ensure the correctness of the algorithm until the dataset is ready.


### Demo

To run a demo of the algorithm (integration is pending to SkyWalking):
Expand Down Expand Up @@ -39,4 +43,43 @@ the URI domain. Which includes:
3. The URIDrain algorithm doesn't involve pre-masking of the URI sequences to prevent false assumptions.
4. The URIDrain algorithm takes preceding and subsequent URI tokens into account when deciding if a matched cluster
should be updated.
5. **TODO**: The URIDrain algorithm optionally use English Corpus to help identify likely non-parameter tokens.
5. **TODO**: The URIDrain algorithm optionally use English Corpus to help identify likely non-parameter tokens.

**Known Caveats**:
The algorithm may provide false clustering in some edge cases (although it doesn't hurt at all in APM scenarios).
The caveat is led by the fact that some different endpoints may contain common params accidentally
in extremely rare cases (When the incoming sequence order is in bad luck). Example:
```text
cur_sim = 0.25 for cluster 1, cluster.log_template_tokens = ('api', 'v2', 'customers', 'xyz789') with param count = 0
cur_sim = 0.5 for cluster 3, cluster.log_template_tokens = ('api', 'v1', 'projects', '<:VAR:>') with param count = 1
cur_sim = 0.75 for cluster 7, cluster.log_template_tokens = ('api', 'v1', 'wallets', 'abcdef') with param count = 0
cur_sim = 0.5 for cluster 8, cluster.log_template_tokens = ('api', 'v1', 'bills', 'abcxyz') with param count = 0
cur_sim = 0.5 for cluster 9, cluster.log_template_tokens = ('api', 'v1', 'services', 'abc456') with param count = 0
cur_sim = 0.75 for cluster 12, cluster.log_template_tokens = ('api', 'v1', 'companies', 'xyz123') with param count = 0
seq1 (incoming uri) = api/v1/companies/abcdef
seq2 (matched template) = api/v1/wallets/abcdef
```
another example
```text
seq1 (matched 1) = api/v1/haha/456/anotherpath2

seq1 (incoming uri) = api/v1/haha/456/actualpath1

seq2 (matched 2) = api/v1/haha/123/actualpath1
```
This can be mitigated with NLTK to detect which tokens are likely to be parameters. This is not implemented yet.

General rule is: do not trust a template that only have size 1 and has no params identified: it's likely to be a false classification.

TODO: Add postprocessing for such single templates (it's single because algorithm has preference for correct template with param count.
(IFF template is size 1 and has no params and is almost identical to another template, merge them)

### Integration
This project rely on gRPC to communicate with the Apache SkyWalking AI pipeline. The gRPC service definition can be found
in the `server/proto/' folder.

Compile the proto by running `make gen` or simply `make env` if you are get started from a bare environment.

### TODO
Try catch statements to handle uncovered algorithm errors

2 changes: 1 addition & 1 deletion demo/Endpoint100_trivial.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
/api/v999/orders/def456def456def456def456def456def456def456def456def456def456def456def456
/api/v999/orders/abcxyzdef456def456def456def456def456
/api/v999/orders/123abc456
/api/v777/orders/123abc456
/api/v777/orders/123abc456-thing-only-appear-once-must-be-handled-by-nltk-in-v2
/api/v1/accounts/123
/api/v1/invoices/abc
/api/v1/accounts/xyz
Expand Down
17 changes: 11 additions & 6 deletions demo/Endpoint200_hard.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
/api/v1/companies/789
/api/v1/companies/abc
/api/v1/companies/def
/api/v1/companies/ghi
/api/v1/companies/ghig
/api/v1/companies/123abc
/api/v1/companies/xyz123
/api/v1/companies/abc456
Expand All @@ -17,7 +17,7 @@
/api/v1/projects/789
/api/v1/projects/abc
/api/v1/projects/def
/api/v1/projects/ghi
/api/v1/projects/ghibb
/api/v1/projects/123abc
/api/v1/projects/xyz123
/api/v1/projects/abc456
Expand Down Expand Up @@ -86,17 +86,21 @@
/customer/hahaha0917/profile/121033/compare/hahaha0297/profile/12105
/customer/hahaha0927/profile/121
/api/v1/users/123/posts/456/comments/789
/api/v1/users/abc/posts/def/comments/ghi
/api/v1/users/abc/posts/def/comments/ghihaha
/api/v1/companies/123/employees/456/reviews/789
/api/v1/companies/abc/employees/def/reviews/ghi
/api/v1/companies/abc/employees/def/reviews/9971029019
/api/v1/companies/123/tasks/456/assignees/789
/api/v1/companies/abc/tasks/def/assignees/ghi
/api/v1/companies/abc/tasks/def/assignees/ghibbb
/this-is-illegalstyleendpoint/1112/1231313/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/this-is-illegalstyleendpoint/1112/1231313/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/this-is-illegalstyleendpoint/1112/1231313/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/this-is-illegalstyleendpoint/2222/2222222/a-should-have-one-cluster-each-row-need-special-handling-else-explodes
/api/v1/users/123/posts/456/comments
/api/v1/users/789/posts/321/comments
/api/v1/users/abc/posts/def/comments
/api/v1/users/xyz/posts/mno/comments
/api/v1/users/101/posts/102/comments
/api/v1/users/ghi/posts/jkl/comments
/api/v1/users/g213hi/posts/jkl/comments
/api/v1/users/pqr/posts/stu/comments
/api/v1/users/111/posts/222/comments
/api/v1/users/333/posts/444/comments
Expand Down Expand Up @@ -159,4 +163,5 @@ google.com/api/v1/users/123
tmall.cn/api/v1/users/904
special.helloword.com/api/v1/users/123ada
top1.abc.example.com.net.cn/api/v1/users/badwdw
top1.abc.example.com.net.cn/api/v1/users/this-url-is-special-since-domain-hasdigits
GET:/api/v2/users/1222222223/similar-to-these-single-occurrence-is-not-handled-correctly-will-be-handled-by-nltk!!!!!
Loading