Skip to content

multi: introduce status server and sub-server manager #442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

ellemouton
Copy link
Member

@ellemouton ellemouton commented Oct 28, 2022

This PR does a few things:

  1. Adds a new subServerMgr that handles the connections to Loop, Pool & Faraday. This will also make it easy for us to add another sub server in future. This is also nice because it makes it easier in the next steps to treat these sub-servers ,& any errors we may get from them, differently to how we would treat errors from LND or LiTd.
  2. Adds a new Status server that is "always-on". This will allow the UI to query info about the status of any of LiTs sub-servers (including LND, Loop, Pool, Faraday and LiT itself).
  3. Does a big refactor so that Litd does not ever shutdown unless requested to via a shutdown signal. This ensures that the Status server can always be reached.
$ litcli status 

{
        "sub_servers": {
                "faraday": {
                        "running": true,
                        "error": ""
                },
                "lit": {
                        "running": true,
                        "error": ""
                },
                "lnd": {
                        "running": true,
                        "error": ""
                },
                "loop": {
                        "running": true,
                        "error": ""
                },
                "pool": {
                        "running": false,
                        "error": "could not start pool: <something something>"
                }
        }
}

TODO:

  • documentation explaining the difference (and reasoning) between the lnd 10009 (gRPC only) and the litd 8443
  • More itests to cover the startup process

Fixes #421
Fixes #420
Fixes #290

@ellemouton ellemouton marked this pull request as draft October 28, 2022 20:37
@ellemouton ellemouton force-pushed the statusServer branch 4 times, most recently from 5d1ca50 to 940e413 Compare October 31, 2022 17:29
@ellemouton ellemouton changed the title [WIP] multi: introduce status server and sub-server manager multi: introduce status server and sub-server manager Oct 31, 2022
@ellemouton ellemouton marked this pull request as ready for review October 31, 2022 17:29
@ellemouton ellemouton changed the title multi: introduce status server and sub-server manager [WIP] multi: introduce status server and sub-server manager Nov 2, 2022
@ellemouton ellemouton requested a review from guggero November 2, 2022 11:02
@ellemouton
Copy link
Member Author

@guggero - as discussed offline, I have requested your review here but am mainly looking for approach comments to begin with. And also, no rush on this as I will be afk for the next 2 weeks

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work 💯 This is an important step towards making it possible to unlock lnd from the UI and also to make everything more user friendly.
Will need to do a second pass with a focus on security but looks pretty good for its WIP status!

config.go Outdated
@@ -150,6 +150,9 @@ type Config struct {
LetsEncryptDir string `long:"letsencryptdir" description:"The directory where the Let's Encrypt library will store its key and certificate."`
LetsEncryptListen string `long:"letsencryptlisten" description:"The IP:port on which LiT will listen for Let's Encrypt challenges. Let's Encrypt will always try to contact on port 80. Often non-root processes are not allowed to bind to ports lower than 1024. This configuration option allows a different port to be used, but must be used in combination with port forwarding from port 80. This configuration can also be used to specify another IP address to listen on, for example an IPv6 address."`

TLSCertPath string `long:"tlscertpath" description:"Path to write the self signed TLS certificate for LiT's RPC and REST proxy service (if Let's Encrypt is not used)."`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly mention this only applies to the HTTP(S)Listen port (default 8443)? Just to distinguish it more from the 10009 lnd RPC port.

@@ -280,9 +280,9 @@ func (c *Config) lndConnectParams() (string, lndclient.Network, string,
func defaultConfig() *Config {
return &Config{
HTTPSListen: defaultHTTPSListen,
TLSCertPath: DefaultTLSCertPath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update doc/config-lnd-integrated.md and perhaps explain the difference (and reasoning) between the lnd 10009 (gRPC only) and the litd 8443 (grpc-web, REST, gRPC proxy, UI) a bit more explicitly?
Also this change should be made very explicit in the release notes (not sure where to put the reminder for us).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call here, as it would also be a breaking change (CLI/config wise).

rpc_proxy.go Outdated
// isHandling checks if the specified request is something to be handled by lnd
// or any of the attached sub daemons. If true is returned, the call was handled
// by the RPC proxy and the caller MUST NOT handle it again. If false is
// returned, the request was not handled and the caller MUST handle it.
func (p *rpcProxy) isHandling(resp http.ResponseWriter,
req *http.Request) bool {

if !p.hasStarted() {
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if returning false here is a good idea, as the request will just be passed to the next handler. Instead I think we should check whether it's a request we would handle (the same conditions as we check below) and return an error page (perhaps 503 service unavailable) if we did and only false if we didn't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

terminal.go Outdated
@@ -1437,6 +1450,20 @@ func (g *LightningTerminal) showStartupInfo() error {
return nil
}

// connectLND sets up LiT's LND connection.
func (g *LightningTerminal) connectLND(bufListener *bufconn.Listener) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could turn this into a function by taking the cfg as a parameter and returning the conn. Then we could place this function and the dialBackend() and dialBufConnBackend() into a different file to reduce the size of this one a bit. Maybe lnd_connection.go or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds gooood

@@ -168,21 +166,9 @@ type rpcProxy struct {
}

// Start creates initial connection to lnd.
func (p *rpcProxy) Start() error {
func (p *rpcProxy) Start(lndConn *grpc.ClientConn) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Start the main web server that dispatches requests either to the
// static UI file server or the RPC proxy. This makes it possible to
// unlock lnd through the UI.
if err := g.startMainWebServer(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we start the main web server this early, we have to guard the g.restHandler from being called if it's nil (. Otherwise a REST request coming in too early would panic. Or we move the call to createRESTProxy() to before this call even.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no, creating the REST proxy before the main web server is started will probably not work since the REST proxy will attempt to connect to the main web server on creation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes me wonder if we could extract the main web server into its own struct that owns the REST proxy and RPC proxy (and probably also the status server)? Then the start/stop would be a bit more abstracted as well. But not sure how large of a diff that is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!
Ok so looks like, for now, moving the createRESTProxy() call to before the startMainWebServer call does actually do the trick. Makes sense cause it has always been called in that order.

state_server.go Outdated
},
}

options := []grpcweb.Option{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to spin up our own grpc-web and gRPC server? Can't we instead have this as a sub component of the rpcProxy that already does all of this for us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realised why I did it this way initially - the reason is that if we fail to start LND, then we never get to the registration and serving of the status server. So I think we need to re-add this actually

state_server.go Outdated

// gRPC web requests are easy to identify. Send them to the gRPC
// web proxy.
if s.grpcWebProxy.IsGrpcWebRequest(req) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it feels like this should just be hooked up in the right order to allow the existing rpcProxy to do this.

"requests, lnd possibly still starting or " +
"syncing")
}
handled, err := g.subServerMgr.ValidateMacaroon(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to make sure we don't create any authentication by-passes by accident, for example when starting up the individual daemons. Maybe adding a few extra itests around the whole startup process would be a good idea.

Copy link
Member Author

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the initial review @guggero!

I have addressed most of your comments.

I just still need to add the extra documentation& also add more itests around the startup process. Will re-request once I have done that.

rpc_proxy.go Outdated
// isHandling checks if the specified request is something to be handled by lnd
// or any of the attached sub daemons. If true is returned, the call was handled
// by the RPC proxy and the caller MUST NOT handle it again. If false is
// returned, the request was not handled and the caller MUST handle it.
func (p *rpcProxy) isHandling(resp http.ResponseWriter,
req *http.Request) bool {

if !p.hasStarted() {
return false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

terminal.go Outdated
@@ -1437,6 +1450,20 @@ func (g *LightningTerminal) showStartupInfo() error {
return nil
}

// connectLND sets up LiT's LND connection.
func (g *LightningTerminal) connectLND(bufListener *bufconn.Listener) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds gooood

terminal.go Outdated
if err != nil {
log.Errorf("Could not start subservers: %v", err)
log.Errorf("Could not set up LND clients: %v", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even for logs? thought it only mattered for error variables

// Start the main web server that dispatches requests either to the
// static UI file server or the RPC proxy. This makes it possible to
// unlock lnd through the UI.
if err := g.startMainWebServer(); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!
Ok so looks like, for now, moving the createRESTProxy() call to before the startMainWebServer call does actually do the trick. Makes sense cause it has always been called in that order.

state_server.go Outdated
},
}

options := []grpcweb.Option{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point

@ellemouton ellemouton changed the title [WIP] multi: introduce status server and sub-server manager multi: introduce status server and sub-server manager Dec 3, 2022
@ellemouton ellemouton force-pushed the statusServer branch 4 times, most recently from cb7f9ab to ec46889 Compare December 10, 2022 07:12
@lightninglabs-deploy
Copy link

@ellemouton, remember to re-request review from reviewers when ready

@ellemouton
Copy link
Member Author

!lightninglabs-deploy mute

@ellemouton
Copy link
Member Author

ellemouton commented Jan 25, 2023

muting for the time being. Will pick up again one I have more room on my plate.

Although would be cool to get back to this soon as I think it could make the addition of taro easier

@@ -280,9 +280,9 @@ func (c *Config) lndConnectParams() (string, lndclient.Network, string,
func defaultConfig() *Config {
return &Config{
HTTPSListen: defaultHTTPSListen,
TLSCertPath: DefaultTLSCertPath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call here, as it would also be a breaking change (CLI/config wise).

rpc_proxy.go Outdated
@@ -165,6 +166,9 @@ type rpcProxy struct {

grpcServer *grpc.Server
grpcWebProxy *grpcweb.WrappedGrpcServer

started bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an atomic bool alternatively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still getting thru the diff, but could also use an event type to allow callers to get notifications for when things are fully up vs polling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasStarted thing is more for the sake of external callers who are making calls to Litd before it has fully started up.

rpc_proxy.go Outdated
log.Infof("Handling gRPC web request: %s", req.URL.Path)
p.grpcWebProxy.ServeHTTP(resp, req)
if !p.hasStarted() {
resp.WriteHeader(http.StatusServiceUnavailable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this vs a gRPC status/error like we do on the lnd side?

rpc_proxy.go Outdated
p.grpcServer.ServeHTTP(resp, req)
if !p.hasStarted() {
resp.WriteHeader(http.StatusServiceUnavailable)
_, _ = resp.Write(make([]byte, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment above may moreso apply here. Another route would be an http middleware chain (gRPC package) that checked this field and returned a response if things weren't fully active yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! will give it a spin

rpc_proxy.go Outdated
@@ -61,7 +56,7 @@ func (e *proxyErr) Unwrap() error {
// component.
func newRpcProxy(cfg *Config, validator macaroons.MacaroonValidator,
superMacValidator session.SuperMacaroonValidator,
permsMgr *perms.Manager, bufListener *bufconn.Listener) *rpcProxy {
permsMgr *perms.Manager) *rpcProxy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg passed but not used anywhere (in this commit at least).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this diff removes an arg

subserver_mgr.go Outdated
s.cfg.onStartError(err)
return
}
s.integratedStarted = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need a mutex to protect access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return
}

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be tracked by a wait group.

faradayConn *grpc.ClientConn
loopConn *grpc.ClientConn
poolConn *grpc.ClientConn
lndConn *grpc.ClientConn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason lnd isn't also managed as a sub-server? I guess since it's sort of the "root" sub-server that all the rest depend on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess since it's sort of the "root" sub-server that all the rest depend on?

Yes exactly. It is handled as a special case (see next commit) since Litd cant really continue operation if the start or connection to LND fails

terminal.go Outdated
lndGrpc *lndclient.GrpcLndServices,
withMacaroonService bool) error {

return poolServer.StartAsSubserver(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re that comment re converting many of the closures to be an interface, given this is just a pass thru call (same args and return value), a wrapper struct that embeds all the main daemon configs would take care of this nicely.

Copy link
Member Author

@ellemouton ellemouton Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to interfaces - turned out nicely I think

Copy link
Contributor

@positiveblue positiveblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these changes @ellemouton , they'll make my life easier for integrating the taro daemon 💯

rpc_proxy.go Outdated
@@ -279,17 +298,27 @@ func (p *rpcProxy) isHandling(resp http.ResponseWriter,
if p.grpcWebProxy.IsGrpcWebRequest(req) ||
p.grpcWebProxy.IsGrpcWebSocketRequest(req) {

log.Infof("Handling gRPC web request: %s", req.URL.Path)
p.grpcWebProxy.ServeHTTP(resp, req)
if !p.hasStarted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function return false if p.hasStarted() is false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous discussion on this: #442 (comment)

We dont want to return false here cause then Litd will pass it on to other services. We instead want to exit early

rpc_proxy.go Outdated

handled, subServerName := p.subServerMgr.HandledBy(requestURI)

switch {
Copy link
Contributor

@positiveblue positiveblue Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can simplify this logic if it looked like:

// old checkSubSystemStarted now called
func (p *rpcProxy) hasRegisteredMatchingMethod(uri string) error {
        // HandledBy checks for status server, lnd, lit and all the subservers
        system, err := p.subServerMgr.HandledBy(uri)
        if err != nil {
                fmt.Errorf("no matching method ...", err)
        }

        // getSubServerState returns alwasy true for status + state for the other subsystems
        started, startErr := p.statusServer.getSubServerState(system)
	if !started {
		return fmt.Errorf("%s is not running: %s", system, startErr)
	}

	return nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we do want to check lit's other parts (LND and Lit) if the sub-server manager does not handle the request URI. So not sure I follow?

subserver_mgr.go Outdated
// onStartError will be called if there is a startup error (in
// integrated mode) or if the remote connection could not be made in
// remote mode.
onStartError func(err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan onStartError, onStartSuccess to do more things in the future beyond setting the status for this subserver? If not we can directly set the status using subServerCfg.name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point - removed

subserver_mgr.go Outdated

// remote must be set if the sub-server is running remotely and so must
// not be started in integrated mode.
remote bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here remote == (remoteCfg != nil)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - remoteCfg is always set by default.

subserver_mgr.go Outdated

// StartRemoteSubServers creates connections to all the manager's sub-servers
// that are running remotely.
func (s *subServerMgr) StartRemoteSubServers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The func name can be more accurate if we use something like ConnectRemoteSubServers given that we are not starting anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! done!

In this commit, we remove Lit's dependency on LND's tls cert in
integrated mode. With this commit, even if Lit is started in integrated
mode, it will create its own tls cert to use for serving the webserver.
In this commit, a new LitService proto is added and the StopDaemon
method is added to the service. This is added due to the fact that in an
upcoming commit, we want Litd to continue running even if LND cannot be
connected to and so this means that Lit will no longer shutdown if LNDs
StopDaemon method is called.
Add a `started` variable to the rpcProxy that is used to indicate if the
proxy is ready to handle requests. This is because currently the
webserver is dependent on the rpcProxy to start and we want to be able
to start the webserver without being dependent on the rpcProxy so that
it can be used to handle status requests in a future commit. So with
this commit, we can now saftely start the webserver earlier on and then
if requests come through for the rpcProxy, an error will be displayed to
the user.
Remove the responsibility of creating an LND connection from the
rpcProxy and instead let the main LightningTerminal struct handle it.
All the lnd-connection specific functions are also moved into their own
file.
A pure refactor commit. It just splits various parts of the main Run
function out into helper functions in order to more clearly show the
steps taken to set up LiT.
Since all the relevant dependancies have been accouted for, we know move
the starting of the main webserver to earlier on in the Run function.
This is in preparation for later when the webserver will be "always-on"
and will be used to server status information about the rest of the LiT
subservers.
Add the protos for a new Status server that can be used to query the
status of a number of LiT's subservers.
In this commit, the various IsURI methods of the permissions manager are
removed and replaced with a single IsSubServerURI method that gets
passed both the URI and the subserver that should be checked.
This will greatly simplifiy implementing a subserver manager in an
upcoming commit.
Define a SubServer interface and a subServer manager
With this commit, we now also track the start-up state of LiT and LiT's
connection to LND. These differ from the other sub-servers (loop, pool &
faraday) because a failure to start LiT or LND is fatal and so should
stop the rest of the start-process, however, we still want the webserver
to continue serving the new Status server so that the UI can query the
start-up status of LND and LiT. So: if any errors occur while
starting/connecting to LND or any other errors occur while starting any
of LiTs other processes, then we throw an error but we dont kill the
main LiT process. The main process is only killed upon receiving a
shutdown signal.
@ellemouton
Copy link
Member Author

ellemouton commented Feb 16, 2023

I need to switch away from this for a bit now so just making a note of some issues that the current version still has:

  • THe itests seem to not exit
  • we currently construct the autopilot server and session server before getting to the registration of rpc servers. This means that if the setup of those fail, then we cant access the status server to see this. And the reason we ordered it this way is so that LND can register non-nil servers. so a bit of a catch 22 here...
  • I think to solve the above, we might need to add a new grpc server that just servers the status server. So that it is not dependent on the start up of LND. see here

@ellemouton
Copy link
Member Author

replaced by #516 & #541

@ellemouton ellemouton closed this May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Show LND/LiT version mismatch error in Web UI Show dependency issues in the UX litd exits with code 0 when it stops due to non-0 lnd exit
5 participants