-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sleep failpoint status #47
base: master
Are you sure you want to change the base?
Conversation
Incorporated the following changes:
There was one comment about using subtests. I will investigate it. |
runtime/http.go
Outdated
fp := key[:len(key)-len("/count")] | ||
_, count, err := Status(fp) | ||
if err != nil { | ||
http.Error(w, "failed to GET: "+err.Error(), http.StatusNotFound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrNoExist
is not the only error that Status can return. Please properly handle errors to avoid separate 400 from 500.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand comment especially the part of separate 400 and 500. For error handling I have changed code to output a specific error and fallback to a "catch-all" 500 generic error (link)
runtime/termscounter_test.go
Outdated
fp: "ExampleString", | ||
desc: `10*sleep(10)->1*return("abc")`, | ||
runafter: 12, | ||
want: 11, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 11? Maybe it was answered on previous PR, but please leave at least a comment explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated with latest code with a comment. Please review.
runtime/termscounter_test.go
Outdated
|
||
for i := 0; i < tc.runbefore; i++ { | ||
exampleFunc() | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this was included by the original author. I have removed it.
runtime/termscounter_test.go
Outdated
fp string | ||
desc string | ||
newdesc string | ||
runbefore int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runbefore int | |
runOldTerm int |
runtime/termscounter_test.go
Outdated
testcases := []struct { | ||
name string | ||
fp string | ||
desc string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
desc string | |
failpointTerm string |
runtime/termscounter_test.go
Outdated
{ | ||
name: "Inbetween Enabling Failpoint", | ||
fp: "ExampleString", | ||
desc: `10*sleep(10)->1*return("abc")`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One interesting followup would be to add a panic
term, so we know the behavior of counter if user of gofail recovers the panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a panic, gofail does not recover from it. It fails the test. I haven't included the change in the update.
--- FAIL: TestTermsCounter/Inbetween_Enabling_Failpoint2 (0.02s)
panic: failpoint panic: ExampleString [recovered]
panic: failpoint panic: ExampleStringgoroutine 24 [running]:
testing.tRunner.func1.2({0x12a7000, 0xc0000a1440})
/usr/local/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1399 +0x39f
panic({0x12a7000, 0xc0000a1440})
/usr/local/go/src/runtime/panic.go:884 +0x212
go.etcd.io/gofail/runtime.actPanic(0xc0000a2d40)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:326 +0xe7
go.etcd.io/gofail/runtime.(*term).do(...)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:294
go.etcd.io/gofail/runtime.(*terms).eval(0xc0000a6640)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:108 +0x109
go.etcd.io/gofail/runtime.(*Failpoint).Acquire(0x0?)
/Users/prasadc/dev/repo/gofail/p1/runtime/failpoint.go:38 +0xa5
go.etcd.io/gofail/runtime_test.exampleFunc()
/Users/prasadc/dev/repo/gofail/p1/runtime/termscounter_test.go:145 +0x25
go.etcd.io/gofail/runtime_test.TestTermsCounter.func1(0xc000132340)
/Users/prasadc/dev/repo/gofail/p1/runtime/termscounter_test.go:78 +0x106
testing.tRunner(0xc000132340, 0xc0000a1350)
/usr/local/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL go.etcd.io/gofail/runtime 0.497s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gofail doesn't recover panics, but the client using the library might wrap the gofail code in recovery.
@serathius Thanks for your input. I have incorporated the review comments. Please review. |
Please squash & signoff the commits, I will take a look later. |
a5fc0ef
to
b54c139
Compare
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please also update README.md. There is also design.md, but I didn't find appropriate section there, maybe Step 3.
|
||
} | ||
|
||
func exampleFunc() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add some comments about this test setup. My understanding is that exampleFunc
mimics code generated by gofail and __fp_ExampleString
mimics fail.go
file. This way you don't need to actually run gofail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment for this function and the variable that together mimics the actions of code package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need exampleFunc in this test file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It simulates the triggering of the failpoint. We can achieve the same thing by a combination of "envTerms" and "Acquire()" as seen in failpoint_test.go. I can remove and replace if that is the preferred way.
runtime/termscounter_test.go
Outdated
// This refers to the number of term executions and is | ||
// independent of caller execution frequency. E.g., Run | ||
// failpoint actions only 11 times even if we have a greater | ||
// number of callsite executions (12 in this case) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this explains why run failpoint actions
!= callsite executions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the reason is to qualify the number of times the failpoint actions happen. So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict. Since the original author is not active I am not sure if this is indeed the reason. Maybe @ahrtr can comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason is to qualify the number of times the failpoint actions happen
Yes, we want to count number of times failpoint i executed.
So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict
Don't understand
Please double check the reason and document it as a comment. We shouldn't merge code that we don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that 10*sleep(10)->1*return("abc")
is called a chain
(split on ->
). Gofail will try to execute terms in chain until one succeeds. Terms have mods
: count, percentage, list. In our example first term sleep(10)
will be executed 10 times (count mod) and then second term return("abc")
will be allowed to execute 1 time.
That's why we get 11 executions. Docs aren't great around this behavior and I'm not sure chains are useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict
Don't understand
This is implemented by mods(code). The terms used in the example (10*sleep(10)->1*return("abc")
) have mods in them. You can specify a floating point probabilityand the term actions will be executed only % of callsite executions. If you remove mods i.e., not have int or float in the terms, then the number of term executions will equal callsite executions. This example tests mods and so the number of callsite executions != failpoint actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Don't worry about explaining this to me. Just improve the comment to explain the test case as accurate as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the comment. Please check.
9e448e5
to
f7a29be
Compare
I am investigating failures because of cpu flag in go test. Will update with the fix. |
runtime/termscounter_test.go
Outdated
"go.etcd.io/gofail/runtime" | ||
"testing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"go.etcd.io/gofail/runtime" | |
"testing" | |
"testing" | |
"go.etcd.io/gofail/runtime" |
runtime/termscounter_test.go
Outdated
// This variable mimics the code generated by gofail code package. | ||
// This works in tandem with exampleFunc function. | ||
var __fp_ExampleString *runtime.Failpoint //nolint:stylecheck | ||
|
||
// check if failpoint is initialized as gofail | ||
// tests can clear global variables of runtime packages | ||
func initFP() { | ||
if __fp_ExampleString == nil { //nolint:stylecheck | ||
__fp_ExampleString = runtime.NewFailpoint("ExampleString") //nolint:stylecheck | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two generic comments:
- As @lavacat mentioned, the design doc (and probably readme) need to be updated.
I will update this in the next checkin.
- Please also consider to setup the e2e test (Add e2e test #40), to verify the functionality end to end instead of just unit test; but of course, can be in a separate PR.
Sure, I will look at creating a e2e setup in a separate PR
|
||
} | ||
|
||
func exampleFunc() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need exampleFunc in this test file?
@@ -41,6 +41,8 @@ type terms struct { | |||
|
|||
// mu protects the state of the terms chain | |||
mu sync.Mutex | |||
// counts executions | |||
counter int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be a field/attribute of Failpoint, and we need to clearly define what does counter exactly mean. There are two options:
- The count of the failpoint execution, no matter the terms are evaluated or not.
- The count of the failpoint execution, only include the case that the terms are really evaluated. For example,
40.0%return(true)
means 40% possibility to returntrue
. If it returns false (falls into the other 60% possibility), then we don't increment thecounter
.
Your current implementation should be the second option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the definition of counter. Regarding making it a field of Failpoint, I have the following concerns
- There are number of similar fields that are defined in terms rather than in failpoint (e.g., fpath, desc), should we standardize all, rather than moving just one field into Failpoint ?
- This will involve changes to parsing code, get Failpoint lock and reference in terms.go.
Should these be addressed as a refactor in a separate PR ?
Two generic comments:
|
Signed-off-by: Ramil Mirhasanov <ramil600@yahoo.com> Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com> Co-authored-by: Marek Siarkowicz <marek.siarkowicz@protonmail.com> Co-authored-by: Benjamin Wang <wachao@vmware.com>
4b31d16
to
4e2ac03
Compare
|
ping @pchan |
I dropped this, will get it done by the end of this month. |
That's great, thanks. Can you take a look into one issue I found in etcd-io/etcd#16776? Status endpoint works for failpoints that were setup using HTTP endpoint. However, they don't seem if they were setup up using environment variables. |
@pchan any progress? @ZhouJianMS would you be interested in looking into this? |
This issue is due to "beforeApplyOneConfChange" not hit. I have tested with "raftBeforeSave" as an environment variable failpoint and it works. |
This again became even more important for etcd-io/etcd#17680 |
This PR is based out of @ramil600 PR #37 and is intended to address etcd-io/etcd#14729.
The current status is to update Ramil's code to run it successfully. Further updates will require will incorporate changes requested in #37 (comment)
cc: @ramil600 @ahrtr @serathius @lavacat