Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test homeserver restarts during a partial state join #378

Closed

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented May 24, 2022

Sean Quah added 4 commits May 24, 2022 19:03
…serve requests

Signed-off-by: Sean Quah <seanq@matrix.org>
…hey will change on restart

Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
Comment on lines +222 to +229
HostIP: "127.0.0.1",
HostPort: strconv.Itoa(port1),
},
},
nat.Port("8448/tcp"): []nat.PortBinding{
{
HostIP: "127.0.0.1",
HostIP: "127.0.0.1",
HostPort: strconv.Itoa(port2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must provide an explicit port, otherwise we'll get a different random port when the container is restarted. Which breaks everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it break everything?

Copy link
Contributor Author

@squahtx squahtx Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseURL and FedBaseURL of HomeserverDeployment become incorrect because they include the port number. We can fix up those URLs. But any CSAPI clients will continue to use the previous port number after the restart and that's not so easy to fix up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a reasonably easy thing to fix? There's several possible solutions:

  • Add documentation to Restart() which explicitly states that any clients created with this deployment need to be recreated. Or...
  • Add a callback hook to client.CSAPI to update the base URL and get test authors to call that callback with the updated URL. Or...
  • Remember all client.CSAPI instances created via func (d *Deployment) Client(t *testing.T, hsName, userID string) *client.CSAPI and automatically call the callback hook when the deployment is restarted to re-point port numbers.

The last option is the most preferable because:

  • All the complexity is in one place and does not need to be repeated in tests.
  • Test authors don't need to know or care that the port number changes from under them when they call deployment.Restart().

In addition, I would change the function signature of restart from:

func (dep *Deployment) Restart() error

to

func (dep *Deployment) Restart(t *testing.T)

which will fail the test if the restart fails (returns an error).

At present, this PR cannot be accepted because it will introduce flakiness because these port numbers will race: we have to let the underlying OS decide the ports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split the PR and made the changes in #396. Let me know if I've missed anything.

}()

// wait for the state_ids request to arrive
psjResult.AwaitStateIdsRequest(t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AwaitStateIdsRequest cannot be used twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to drop the second usage rather than try to rejig things.

@squahtx squahtx requested review from a team and erikjohnston May 27, 2022 15:22
@squahtx
Copy link
Contributor Author

squahtx commented May 27, 2022

@matrix-org/dendrite-core The first three commits touch some core complement machinery and might be worth taking a look at?

@squahtx
Copy link
Contributor Author

squahtx commented May 27, 2022

@erikjohnston I made one small change: c0caa5e

@kegsay kegsay self-requested a review May 31, 2022 14:55
@kegsay kegsay self-assigned this May 31, 2022
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main problem with this lies with the usage of an explicit port. We cannot guarantee this as:

  • Tests in the same file can run in parallel via .Parallel()
  • Tests in different directories can run in parallel (modifiable via go test -p N)

We need to preserve that ability to ensure we can grow the test case library without suffering increasingly long test times, so we cannot add features which would compromise that. I've asked for a bit more information than just "which breaks everything".

Comment on lines +222 to +229
HostIP: "127.0.0.1",
HostPort: strconv.Itoa(port1),
},
},
nat.Port("8448/tcp"): []nat.PortBinding{
{
HostIP: "127.0.0.1",
HostIP: "127.0.0.1",
HostPort: strconv.Itoa(port2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it break everything?

@@ -391,6 +348,39 @@ func deployImage(
return d, nil
}

// Picks two free ports on localhost. Does not reserve them in any way.
// The returned ports must be used before the next call to `allocateHostPorts`,
// otherwise the same pair of ports may be returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot guarantee that as homeservers are deployed in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I didn't realise that we ran tests in parallel at the time.

// The returned ports must be used before the next call to `allocateHostPorts`,
// otherwise the same pair of ports may be returned.
func allocateHostPorts() (int, int, error) {
localhost_any_port := net.TCPAddr{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no snake_case please. Be consistent with existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

internal/docker/deployer.go Show resolved Hide resolved
@squahtx
Copy link
Contributor Author

squahtx commented Jun 14, 2022

This PR is accumulating too many non-test changes, so I'm splitting it into #396 and #397.

@squahtx squahtx closed this Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants