[Rest Server] stop/start/genesis endpoints #454

arun-koshy · 2022-02-15T06:13:46Z

This is the first of a series of PR's that will implement the client service API described as part of the GDC Eng Deliverables

The endpoints that are included in this PR are:

Stop: Stops authorities, deletes configurations & databases
curl --location --request POST 'localhost:5000/debug/sui/stop'
Start: Starts servers with specified genesis configuration
curl --location --request POST 'localhost:5000/debug/sui/start'
Genesis: Create genesis configuration. Run this before you run start server.
curl --location --request POST 'localhost:5000/debug/sui/genesis' \ --data-raw '{ "num_authorities": 4, "num_objects": 5 }'

Known Issues (for future PRs)

Need to refactor out sui_commands::genesis code and share it instead of copying it all into the rest server.
Look into using dynamic ports for port allocator

patrickkuo

I am not familiar with the use case but IMHO start/stop/genesis shouldn't be exposed by the rest services, start/stop/genesis is for provisioning and starting/stoping the network, it should be a server-side thing?

patrickkuo · 2022-02-15T09:39:49Z

sui/src/config.rs

@@ -435,3 +436,24 @@ impl Config for NetworkConfig {
        &self.config_path
    }
 }
+
+pub struct PortAllocator {


lets move this to utils.rs?

This utils.rs file is a code smell for me--I think modules should have descriptive name + and only contain functionality related to that name. I would be fine with (e.g.) network_utils, ports.rs, etc., but vanilla utils is a bit much.

I could move the Config trait into config.rs and rename utils.rs to network_utils.rs and move PortAllocator there?

There is already a network_utils directory by the way, you could consider have a port_allocator.rs there?

Works for me, I will move to network_utils dir

arun-koshy · 2022-02-15T16:45:59Z

I am not familiar with the use case but IMHO start/stop/genesis shouldn't be exposed by the rest services, start/stop/genesis is for provisioning and starting/stoping the network, it should be a server-side thing?

Agreed, I should have made note of this but these endpoints are just for debugging. It will make it easier to restart the network when its hosted on AWS especially for anyone that is outside of our group that may be trialing it before or after GDC.

Post-GDC we can discuss whether we want to decouple the rest server from the network and have the two be started separately OR we could have the network genesis and start happen during startup of the rest server. The former makes more sense so that it would allow us to make changes to the rest server without having to bring down the network.

huitseeker

Yay dropshot!

Thanks for this PR. Let me know if the long comments help plz :)

sui/src/rest_server.rs

huitseeker · 2022-02-15T21:08:39Z

sui/src/rest_server.rs

+    for _ in 0..num_authorities {
+        let (address, key_pair) = get_key_pair();
+        let info = AuthorityPrivateInfo {
+            address,
+            key_pair,
+            host: "127.0.0.1".to_string(),
+            port: match port_allocator.next_port() {
+                Some(port) => port,
+                None => {
+                    return Err(HttpError::for_client_error(
+                        None,
+                        hyper::StatusCode::CONFLICT,
+                        String::from(
+                            "Could not create authority beacause there were no free ports",
+                        ),
+                    ))
+                }
+            },
+            db_path: PathBuf::from(format!("./authorities_db/{:?}", address)),
+        };
+        authority_info.push(AuthorityInfo {
+            address,
+            host: info.host.clone(),
+            base_port: info.port,
+        });
+        authorities.insert(info.address, 1);
+        network_config.authorities.push(info);
+    }


Could we modularize the setup bits that are already in sui::genesis() for this?

I agree, I should be calling into this code rather than copying the genesis code from sui_commands. I will do that refactoring in a separate PR if that is okay with you?

sui/src/rest_server.rs

huitseeker · 2022-02-15T21:57:42Z

sui/src/rest_server.rs

+fn sync_client_state(client_state: &mut ClientState<AuthorityClient>) -> Option<HttpError> {
+    match cb_thread::scope(|scope| {
+        scope
+            .spawn(|_| {
+                // synchronize with authorities
+                let rt = Runtime::new().unwrap();
+                rt.block_on(async move { client_state.sync_client_state().await })
+            })
+            .join()
+    }) {


OK, here's my theory about what happened:

You've set out to write a run-of-the mill blocking function.

As you've noted, ClientState::sync_client_state takes an exclusive reference to client_state, and you've tried to reproduce this in your function, taking a &mut ClientState<AuthorityClient>. ¹

Then you've tried to spawn a thread to call the sync, but noticed it was requiring a 'static lifetime on the client_state, because it can't guarantee the reference it embarks will have a lifetime sufficient for the arbitrary lifetime of the thread².

So you've looked for an abstraction that limits the lifetime required by the thread, and found crossbeam's scoped threads, which lets you do a short-lived borrow of the authority client: cool!

But then you got stuck on calling an async method from that thread, which makes you create an entirely new runtime.

If I'm correct about this, here are a few points:

on 1 & 2: if you're going to run a long-lived process that potentially never stops, it would be OK to actually require a long lifetime from the caller by taking a 'static reference, or taking a copy of the input you require ...

... but asynchronous scoped threads aren't a thing, so your thread::spawn is blocking ³, and what you're looking at is a dedicated thread.

... and you're immediately joining the thread, thereby blocking your main thread on the back end of that dedicated thread. 😅

as you can probably guess by now, you can dispense with the threading ceremony and just call this async function in your main thread. And you've already figured out how to do that!

fn sync_client_state(client_state: &mut ClientState<AuthorityClient>) -> Option<HttpError> { // synchronize with authorities let rt = Runtime::new().unwrap(); // TODO: remove or change to an expect let res = rt.block_on(async move { client_state.sync_client_state().await }); res.err().map(|err| HttpError::for_client_error( None, hyper::StatusCode::FAILED_DEPENDENCY, format!("Sync error: {:?}", err), )) }

left as a TODO to the author: do you need a blocking function here? If yes, why? if not, can you make this function shorter?

Footnotes

In general, a commendable instinct, but read on. ↩

The point is you could return from our function, from which the spawned thread is borrowing some data, and the borrow checker is protecting you from a use after free here. ↩

Congrats on not going for the one crate out there that claims to do this, but is actually unsound. ↩

I actually did see what you were describing in your theory above and after being yelled at by the compiler sufficiently I was able to get it working (which may require some feedback in a future PR). However the crossbeam scoped thread was to solve the following issue which only happens at runtime.

thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like block_on) attempted to block the current thread while the thread is being used to drive asynchronous tasks.'

After hours of research online about starting a runtime within a runtime this is the best I found. The only other option was NOT to start a runtime from within a runtime which did not seem possible for our use case. But I am open to other approaches to solve this.

Let's forget about runtimes in runtimes and my theories: what are you trying to do?
Step 0:

You've set out to write a run-of-the mill blocking function.

If this is correct, why?

do you need a blocking function here? If yes, why? if not, can you make this function shorter?

This is the same question as the above.

You were correct, I was doing too much. Thanks for the simplification/explanation.

huitseeker · 2022-02-15T22:10:58Z

sui/src/rest_server.rs

+    server_context.server_lock.store(true, Ordering::SeqCst);
+    thread::spawn({
+        move || {
+            rt.block_on(join_all(handles));
+        }
+    });


So here you are already in an asynchronous function, and you launch a thread by hand, pass it a manually-crafted runtime, and make it call an async function.¹ That async function happens to be a composite of the result of the futures you collected into handles, but that's the gist of it.

My point is: this is doing something very specific and extraordinary, what is the intent?

Footnotes

we in general use multi-threaded Tokio runtimes that support scheduling on the OS's available threads for non-blocking tasks. ↩

The intent was to run the authorities in a background thread and ideally then be able to kill that thread (and in turn kill the authorities) when the /sui/stop endpoint is called. I have not been able to figure out how to execute that whole flow properly in rust.

I was hoping to use server_lock: Arc<AtomicBool> to accomplish this but I cannot poll that resource for changes within the thread I spawned because of rt.block_on(join_all(handles)) blocking that thread. When I started considering starting another thread to wrap this thread and then poll there, I decided I was crazy so I punted this until I could get some assistance 😅

Getting help when it feels like you're going crazy is the right move.

First, let's simplify the problem by setting aside cancellation for a minute.
You may want to look at tokio tasks (as opposed to bare async futures), and see if (considering my footnote above) they wouldn't correspond to your desires better than an explicit OS thread.

If they do, then let's look at cancellation:

the task's JoinHandle comes with an abort

tasks are also futures (launched differently), so you can use Abortable

tokio has a page on Graceful shutdown

The core idea is the same: use a message-passing channel to transmit your cancel signal, and make a wrapping task embark both the receiving end of the channel and the core task. The wrapping task can listen to both sources of signal (your task returning or the cancel signal) and act accordingly. Here's the doc on select, which includes a specific mention of the cancellation pattern.

Was able to swap out the previous implementation with tokio tasks and now I can properly kill the authorities and restart the server from within the endpoints. Thanks for the tips

huitseeker · 2022-02-15T22:19:01Z

sui/src/utils.rs

+    pub fn next_port(&mut self) -> Option<u16> {
+        for port in self.next_port..65535 {
+            if TcpListener::bind(("127.0.0.1", port)).is_ok() {
+                self.next_port = port + 1;
+                return Some(port);
+            }
+        }
+        None
+    }


So this is only valuable if you want to specifically reach for deterministic port allocation and are kinda-sorta hoping you'll find a clean-ish port occupation picture at the moment of call¹. You should probably take a usize argument to let the caller tell you at which port number they'd like to start their port search.

But in general, the Linux API (and we run authorities on Linux) tells you that if you try to bind on port 0, you'll get a dynamic port allocation. This might be simpler?

Footnotes

this might work in a container. ↩

I will make a note of this and include this in the refactoring that goes along with modularizing genesis to reduce the scope of this PR

velvia · 2022-02-19T00:48:55Z

Just a quick comment - for debug routes wouldn't it be better to put it in a special debug namespace?

sblackshear · 2022-02-22T01:34:57Z

@arun-koshy what's the status of this?

arun-koshy · 2022-02-22T15:48:23Z

@arun-koshy what's the status of this?

Met with Francois on Friday to go over his comments on the proper ways to do the async portion of this in rust. Have the pseudo code for it, working on implementing it today. Should have it ready for review soon

…err & if let

sui/src/rest_server.rs

huitseeker · 2022-02-23T17:06:42Z

sui/src/rest_server.rs

+            genesis_config_path: String::from("genesis.conf"),
+            wallet_config_path: String::from("wallet.conf"),
+            network_config_path: String::from("./network.conf"),
+            authority_db_path: String::from("./authorities_db"),


This is brittle and somewhat panic-inducing: IIUC, if something goes wrong we blow up on a started server that the user has hit an endpoint of, not. on starting the server itself. I would at least open an issue for the following:

Instances of NetworkConfig, GenesisConfig, etc should be created at the ServerContext constructor.

That constructor should hit all panics due to invalid or non-existent paths, as well as basic validations (e.g. empty quorum).

Ideally, those should re-use the config creation logic from the client. (which should be made reusable)

Moreover, all of this setup is re-done from scratch on every endpoint hit. Is there a reason why this should happen?

We are only exposing these endpoints because we plan on hosting instances of the rest server for game developers. Which means if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints. In the long term none of this will matter because we shouldn't be starting the Sui network from within the rest server anyways. I can open an issue discussing the future design of the rest server post-GDC.

Moreover, all of this setup is re-done from scratch on every endpoint hit. Is there a reason why this should happen?

Because the server context is instantiated once here, all other changes to the server context happen via the endpoints. There are checks in the endpoints to see if the network is already running or the configs are already created.

if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints

What's wrong with having, as we seem to, one endpoint that starts the network and calls the constructor of the server's state, one that tears it down? It would be a contract with our early test developers that nothing interesting is going to happen if they don't hit that particular bootstrap first, and that they'll carry over the networks' state if they don't tear it down before a restart.

Then we can have real state, with configuration that contains actual data rather than recreating everything on the fly.

My issue isn't with the re-startability it's with the multiplication of panic-prone config read and scaffolding code in each endpoint handler.

sui/src/rest_server.rs

huitseeker · 2022-02-23T19:23:34Z

sui/src/rest_server.rs

+            genesis_config_path: String::from("genesis.conf"),
+            wallet_config_path: String::from("wallet.conf"),
+            network_config_path: String::from("./network.conf"),
+            authority_db_path: String::from("./authorities_db"),


if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints

What's wrong with having, as we seem to, one endpoint that starts the network and calls the constructor of the server's state, one that tears it down? It would be a contract with our early test developers that nothing interesting is going to happen if they don't hit that particular bootstrap first, and that they'll carry over the networks' state if they don't tear it down before a restart.

Then we can have real state, with configuration that contains actual data rather than recreating everything on the fly.

My issue isn't with the re-startability it's with the multiplication of panic-prone config read and scaffolding code in each endpoint handler.

huitseeker · 2022-02-23T19:26:35Z

sui/src/rest_server.rs

+    for authority in &network_config.authorities {
+        let server = sui_commands::make_server(
+            authority,
+            &committee,
+            vec![],
+            &[],
+            network_config.buffer_size,
+        )
+        .await
+        .map_err(|error| {
+            custom_http_error(
+                StatusCode::CONFLICT,
+                format!("Unable to make server: {error}"),
+            )
+        })?;
+        handles.push(async move {
+            match server.spawn().await {
+                Ok(server) => Ok(server),
+                Err(err) => {
+                    return Err(custom_http_error(
+                        StatusCode::FAILED_DEPENDENCY,
+                        format!("Failed to start server: {}", err),
+                    ));
+                }
+            }
+        })
+    }
+
+    let num_authorities = handles.len();
+    info!("Started {} authorities", num_authorities);
+
+    while let Some(spawned_server) = handles.next().await {
+        server_context
+            .authority_handles
+            .lock()
+            .unwrap()
+            .push(task::spawn(async {
+                if let Err(err) = spawned_server.unwrap().join().await {
+                    error!("Server ended with an error: {}", err);
+                }
+            }));
+    }


it seems we're maintaining a copy of this in sui_commands, called start_network

I am handling the async portion of this differently than start_network does. I am maintaining a hold of the JoinHandle so that I can abort it later during start. Don't think reusing sui_commands makes sense here, but correct me if I am wrong.

Fair enough. Additionally, as we've seen, the pattern of what's going on in start_network and genesis in sui_commands doesn't make sense: the task waits on each launched server to finish (rather than wait for each server to start) here:
https://github.com/MystenLabs/fastnft/blob/dba291bfac0d9a9811e9a296f3f4712744ef2df6/sui/src/sui_commands.rs#L86-L88
As we've seen, that's not really usable in a task, which should keep a handle rather than block on completion.

Would you and @patrickkuo open an issue to track the refactoring of start_network in a way that:

has clear semantics on starting the server,

makes this usable in a tokio task (by returning the handle of the SpawnedServer rather blocking on it,

de-duplicates the code between the present function and sui_commands
?

patrickkuo

Some comments on error handling

sui/Cargo.toml

sui/src/rest_server.rs

sui/src/sui_commands.rs

oxade · 2022-02-23T22:00:49Z

@huitseeker any more thoughts on this PR?

patrickkuo

Code-wise looks ok to me, although I still think it's strange to have a endpoint to start/stop the network, maybe down the line we should give the binaries access to partners so they can start their own network locally, to avoid having this service...

I will let @huitseeker to do the final call

huitseeker

Accepting to unblock. I left comments that I believe will help in controlling the amount of technical debt we incur from this area:

making sure the endpoint API looks like what it will in the future, when it does not start the network. That is:
- Building a ServerContext that contains configurations, rather than strings,
- Setting up those configurations in one place (e.g. a constructor) to reflect our temporary need for start/stop endpoints
making sure we do not maintain two parallel sets of code for network operations, one for the CLI, one for the REST server.
- I expect this would involve refactoring what's in sui_commands to make better async sense.
It would be great to see issues tracking this.

arun-koshy · 2022-02-24T18:53:55Z

Thank you for your great feedback on the PR @patrickkuo & @huitseeker. I have opened two followup issues that have stemmed from this PR; Issue#552 & Issue#553

* fix: turn off logging stderr+out to file Removing the log file so we can just log to the container stdout and stderr to be picked up by Loki or other means. * fix: fixed depends_on to the actual loki service name * chore: updated README.md about logs Specifically we changed the container entry.sh script to not redirect the logs to files. This enables log scraping directly from the containers into log collection services like Loki.

Add stop/start/genesis endpoints

8fecd38

arun-koshy requested review from huitseeker, lxfind, patrickkuo, sblackshear, gdanezis, stella3d and oxade February 15, 2022 06:13

patrickkuo reviewed Feb 15, 2022

View reviewed changes

move portallocator to utils.rs

f829bb9

huitseeker reviewed Feb 15, 2022

View reviewed changes

patrickkuo mentioned this pull request Feb 16, 2022

[fastx CLI] Add customisation to genesis #461

Merged

sblackshear mentioned this pull request Feb 22, 2022

Feature/basic move call arg linter #508

Closed

oxade mentioned this pull request Feb 22, 2022

Remove utils module #522

Merged

arun-koshy added 5 commits February 22, 2022 16:23

sync to main before applying changes

e47f771

rlint

a56d937

custom http error refactor, simplified async sync fn, refactored map_…

6789820

…err & if let

Stop authorities properly & simplify async code

f2d713e

rlint & fmt

331af96

huitseeker reviewed Feb 23, 2022

View reviewed changes

patrickkuo reviewed Feb 23, 2022

View reviewed changes

sui/src/rest_server.rs Show resolved Hide resolved

use sui_command genesis and other small refactors

82d7c6e

huitseeker reviewed Feb 23, 2022

View reviewed changes

patrickkuo requested changes Feb 23, 2022

View reviewed changes

sui/Cargo.toml Show resolved Hide resolved

sui/src/rest_server.rs Outdated Show resolved Hide resolved

sui/src/rest_server.rs Outdated Show resolved Hide resolved

sui/src/sui_commands.rs Outdated Show resolved Hide resolved

switch to map_err and reuse working dir

2964cba

patrickkuo reviewed Feb 24, 2022

View reviewed changes

huitseeker approved these changes Feb 24, 2022

View reviewed changes

huitseeker requested a review from patrickkuo February 24, 2022 14:18

patrickkuo approved these changes Feb 24, 2022

View reviewed changes

arun-koshy merged commit f13ed53 into main Feb 24, 2022

arun-koshy deleted the client-service-api branch February 24, 2022 18:54

666lcz mentioned this pull request Mar 23, 2022

[explorer] All changes from gdc demo #1041

Merged

[Rest Server] stop/start/genesis endpoints #454

[Rest Server] stop/start/genesis endpoints #454

Conversation

arun-koshy commented Feb 15, 2022 • edited Loading

patrickkuo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sblackshear Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arun-koshy commented Feb 15, 2022

huitseeker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huitseeker Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Footnotes

arun-koshy Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

huitseeker Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

huitseeker Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

velvia commented Feb 19, 2022

sblackshear commented Feb 22, 2022

arun-koshy commented Feb 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arun-koshy Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickkuo left a comment

Choose a reason for hiding this comment

oxade commented Feb 23, 2022 • edited by arun-koshy Loading

patrickkuo left a comment

Choose a reason for hiding this comment

huitseeker left a comment

Choose a reason for hiding this comment

arun-koshy commented Feb 24, 2022 • edited Loading

arun-koshy commented Feb 15, 2022 •

edited

Loading

sblackshear Feb 15, 2022 •

edited

Loading

huitseeker Feb 15, 2022 •

edited

Loading

arun-koshy Feb 16, 2022 •

edited

Loading

huitseeker Feb 16, 2022 •

edited

Loading

huitseeker Feb 16, 2022 •

edited

Loading

arun-koshy Feb 23, 2022 •

edited

Loading

oxade commented Feb 23, 2022 •

edited by arun-koshy

Loading

arun-koshy commented Feb 24, 2022 •

edited

Loading