Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move flux-sched go bindings out of tree #1138

Open
2 of 3 tasks
vsoch opened this issue Feb 6, 2024 · 7 comments
Open
2 of 3 tasks

Move flux-sched go bindings out of tree #1138

vsoch opened this issue Feb 6, 2024 · 7 comments

Comments

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

As discussed in slack, we want to develop the fluxion Go bindings in a separate repository from fluxion. I'm opening an issue here for tracking and discussion outside of slack, where everything is eventually eaten by the slackbot monster. 👹 The reasons are the following (summarized from slack):

  • Unlike Python, the Go bindings are not essential for using flux. Having them "packaged together" also doesn't make sense, because a Go developer is typically going to get them directly from Github anyway.
  • We can ensure compatibility much easier with automated testing and synced releases
  • Right now, we are adding huge complexity building with cmake and Go.
  • It's harder to work on the bindings having any change be to flux-sched, a core project
  • We should be using standard Go testing and not cmake custom commands
  • The situation is going to get worse as time progresses and we need more functionality/testing
  • It should be possible to develop for fluxion without worrying about Go in the same PR.

This needs to happen sooner than later, before there are lots of new Go projects using the bindings. The changes are fairly simple - all we really need is to expose a shared library for fluxion that go can use, and then hugely simplify the entire current project. The testing will be much easier, as will developing. The steps we will pursue are described in the task list below (wanted to try this out)!

Aside from task 1. I'll handle 2-3. And for the repository - I'll want to have admin permission to write/manage flux-framework/fluxion-go if one of the fearless leaders of flux could make that for me (ping @grondo @garlick @trws)! No rush, because we likely will need a few days for bullet 1, and of course I can do a lot of development without the repository!

This should be useful and exciting work, and help us move forward with go projects that use fluxion more easily.

Tasks

@trws
Copy link
Member

trws commented Jul 31, 2024

Where does this stand @vsoch?

@vsoch
Copy link
Member Author

vsoch commented Jul 31, 2024

The first two bullets are done - we have (and use fluxion go) from here: https://github.com/flux-framework/fluxion-go. That also does automated releases with flux-sched here. The third bullet is not up to me to decide, but either way:

  • if we keep the Go code here, there needs to be a way to ensure changes go to fluxion-go
  • If we don't keep Go code here, when something is merged here I should be pinged to work on it in Go, if needed.

Pinging @milroy for his thoughts. At least as our current go projects stand, I'm using flux-sched for the .so builds, but the actual bindings I'm using fluxion-go.

@vsoch
Copy link
Member Author

vsoch commented Aug 4, 2024

@trws can we talk about this at a future meeting? There are several improvements that are warranted for the go bindings, and I'd rather open issues / do work on one repository. For example:

  • Right now the only way to distinguish "unsatisfiable" is when it's not allocated OR reserved, and that requires allowing reservations. The only way to see that message is to do a separate call to flux to get error messages, and that is a bit racy. That second part is an issue in and of itself - it's hard to see what went wrong when it happens - the developer needs special information to call that function, and when they do, they get (possibly) errors associated with another job.
  • The objects returned from the Go API tend to be a long list of simple types. That makes sense for C++, but for Go it would be much nicer to have a struct, and then attach functions to it.
  • We probably want to consider something that would allow a defer.Close() that calls fluxion.ctx.Destroy() instead of needing to call it explicitly.
  • Importantly, we are ready to add grow/shrink to the bindings - I added event handling to fluxnetes this weekend, meaning that if someone changes the size of "their group thing," we'd want to issue grow or shrink to fluxion. Right now I can only handle a complete deletion or cleanup.

The additional reason the above would be important is that I'd like to do the same for the queue in Go, and then have a fluxion-service, and extend that to a qmanager-service for Go. The reasons are two fold (aside from this being immensely fun and I'm learning a lot):

  1. We could create an environment for doing scheduling experiments. I have a lot of very bad ™️ ideas that I'd like to try.
  2. We can better extend flux to the cloud community, enabling them to build full solutions with our bindings.

Anyhoo, let's chat about it! The second point is really important I think - I see that Flux has a future (a much larger footprint than it has now) in cloud, but we need to enable those developers to have handles to the different components.

@trws
Copy link
Member

trws commented Aug 4, 2024

Edit: fixing formatting.

Definitely we can discuss this. Also it sounds like the bindings are already functionally somewhere else and should probably be removed after a CI update to pull and test them on PR here. Some questions/responses below.

  • Right now the only way to distinguish "unsatisfiable" is when it's not allocated OR reserved, and that requires allowing reservations. The only way to see that message is to do a separate call to flux to get error messages, and that is a bit racy. That second part is an issue in and of itself - it's hard to see what went wrong when it happens - the developer needs special information to call that function, and when they do, they get (possibly) errors associated with another job.

This should not be true. We have an "allocate_or_satisfiable" op for this, it's what both FCFS and backfill after reservation depth use. Is it missing from the reapi?

  • The objects returned from the Go API tend to be a long list of simple types. That makes sense for C++, but for Go it would be much nicer to have a struct, and then attach functions to it.

Not sure why this makes sense for C++. I'm guessing this is actually a side-effect of flattening the C++ interface to C then binding to go instead of defining a rich api in C++ and exposing it to go equivalently. C++ prefers an object just as much as go does, would love to do something about this.

  • We probably want to consider something that would allow a defer.Close() that calls fluxion.ctx.Destroy() instead of needing to call it explicitly.

Does defer fluxion.ctx.Destroy() not work or is this to confirm to a resource management API?

  • Importantly, we are ready to add grow/shrink to the bindings - I added event handling to fluxnetes this weekend, meaning that if someone changes the size of "their group thing," we'd want to issue grow or shrink to fluxion. Right now I can only handle a complete deletion or cleanup.

Ok, shrink we can definitely handle in fluxion. Grow is a bigger discussion, I'd be happy to see an issue or discussion on that separately, in fact I think there is one from a looooong time ago if you want to reference or resurrect it.

The additional reason the above would be important is that I'd like to do the same for the queue in Go, and then have a fluxion-service, and extend that to a qmanager-service for Go.

You mean you want to be able to call qmanager? I think that would be a good thing FWIW, and should be relatively easy to bind up if we do it right.

The reasons are two fold (aside from this being immensely fun and I'm learning a lot):

  1. We could create an environment for doing scheduling experiments. I have a lot of very bad ™️ ideas that I'd like to try.

  2. We can better extend flux to the cloud community, enabling them to build full solutions with our bindings.

Anyhoo, let's chat about it! The second point is really important I think - I see that Flux has a future (a much larger footprint than it has now) in cloud, but we need to enable those developers to have handles to the different components.

I agree. As a word of caution though, I don't want us to fall into the classic plugin API trap, where we expose so much, so freely, that we lose the ability to change anything (see Firefox pre-web-extension plugins). Still want to do it, but want to be clear about what is and what isn't stable or public API surface.

@vsoch
Copy link
Member Author

vsoch commented Aug 4, 2024

This should not be true. We have an "allocate_or_satisfiable" op for this, it's what both FCFS and backfill after reservation depth use. Is it missing from the reapi?

We do have that (and that's what I'm using). I think I'd rather have something that is more along the lines of an enum so I can check one thing, and clearly, as opposed to two. E.g., I have to check that the job is not satisfiable by checking that allocated is false and satisfiable is false. At least for the exposed interface, it should be possible to have different states and then do something like:

if status.State == fluxion.Unsatisfiable {}

At least that reads much more nicely in the code, etc.

Does defer fluxion.ctx.Destroy() not work or is this to confirm to a resource management API?

It does work! I just didn't know about it. I think it's mostly just the UI - I'd like to expose defer myFluxion.Close() that calls that so people know to use it (what happens if the client isn't destroyed, by the way? I'm not sure though, because a direct call is much faster and it's easy enough to do that when something blocking before it ends - that's what I'm doing now, exposing like this:

// Destroys properly closes (destroys) the fluxion client handle
func (fluxion *Fluxion) Close() {
	fluxion.cli.Destroy()
}

and then it is called directly when the grpc server exits

err = server.Serve(lis)
if err != nil {
	fmt.Printf("[GRPCServer] failed to serve: %v\n", err)
}
flux.Close()

You mean you want to be able to call qmanager? I think that would be a good thing FWIW, and should be relatively easy to bind up if we do it right.

No, that would be too much burden on the flux devs. I just want to take what I've built and move it external to Kubernetes so others can use it for experiments (myself included). I can do that entirely independently without asking for flux devs time.

I agree. As a word of caution though, I don't want us to fall into the classic plugin API trap, where we expose so much, so freely, that we lose the ability to change anything (see Firefox pre-web-extension plugins). Still want to do it, but want to be clear about what is and what isn't stable or public API surface.

Anything I develop I will take maintainership responsibility for, and (at least beyond the official Go API for fluxion, for example, which should be stable as it mirrors C++ fluxion) I don't plan to make any long term promises.

@vsoch
Copy link
Member Author

vsoch commented Aug 4, 2024

And @trws one more question for you I was thinking about just now - does qmanager hold some memory of jobs out until eternity? I know I can easily see finished job stats, but I don't see that logic with qmanager, so I'm guessing that belongs somewhere with kvs / job manager? Apologies I haven't read the code yet - been doing nothing most of today and kind of digging that too :P Happy Sunday!

@trws
Copy link
Member

trws commented Sep 4, 2024

Sorry I lost the thread on this, qmanager does not hold onto jobs persistently except in certain extreme circumstances where things go wrong enough that we need to know about it. The job-manager in core is what holds onto job data persistently through the KVS, so it gets persisted to sqlite and disk rather than holding everything in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants