Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find spec/node_type of Kepler node for model selection #231

Closed
sunya-ch opened this issue Feb 22, 2024 · 5 comments
Closed

Find spec/node_type of Kepler node for model selection #231

sunya-ch opened this issue Feb 22, 2024 · 5 comments
Assignees
Labels
kind/feature New feature or request

Comments

@sunya-ch
Copy link
Contributor

What would you like to be added?

Flow to link Kepler-deploying node specification to model selection from Kepler model DB.

Why is this needed?

Problem description

As previously, we have only a single node_type in the pipeline. We always put _1 after the trainer name to get the model name. However, with SPECPower and AWS instances, we can now train multiple node_type.

Currently, we have a function generate_spec to generate machine spec implemented in python on kepler-model-server.

Idea

The thing to do is to let Kepler determine know its node_type.
The logic of generate_spec may not need to merge into inside Kepler.
It can run in init container to generate spec and save to a file to mount. Server API may need to update to allow adding machine spec inside the request to select the model.

Note that,

  • node_type is per pipeline determined by node_type_index.json inside the pipeline folder.
  • we can set default pipeline to spec_benchmark for acpi value and aws_instance_pipeline for rapl value.
@sunya-ch sunya-ch added the kind/feature New feature or request label Feb 22, 2024
@sunya-ch sunya-ch self-assigned this Mar 28, 2024
@sunya-ch
Copy link
Contributor Author

Now, working on adding a simple logic on estimator to discover a core number and find the candidate models that built by the machine with the same number of cores. If not exists, list the candidates that have the largest number of cores.

The change needed is the ModelRequest to also add spec field to the request to server-api.

@sunya-ch
Copy link
Contributor Author

sunya-ch commented Aug 20, 2024

@vimalk78 @sthaha Let's summarize and discuss about design here.

Objective:
Submit machine_spec to model-server

Usecase:
BM: dynamically generate by reading CPU info and so on
VM: statically set (since the CPU info is virtualized)

Who sends the model request:

  • kepler (local regressor estimator)
  • kepler-model-server's estimator sidecar

@sunya-ch
Copy link
Contributor Author

sunya-ch commented Aug 20, 2024

Who generates the spec (for BM case):

  • choice 1: for local regressor, kepler. for sidecar estimator, estimator.
  • choice 2: only kepler for both cases
    I took the choice 1 since we don't have an API to pass the generated spec to estimator yet. (we have only power_request call for every prediction via the socket). Secondly, estimator can share the same generate spec function when the training process made a query.

How to pass the spec file (for VM case):
We are planning to pass via the command argument --machine-spec.
This file can be mounted via configmap key.
default file path is /etc/kepler/models/machine/spec.json.

          volumeMounts:
            - name: config-machine
              mountPath: /etc/kepler/models/machine
              readOnly: true
      volumes:
        - name: config-models
          configMap:
            name: kepler-machine-spec
            items:
            - key: m5.metal
              path: spec.json

@rootfs
Copy link
Contributor

rootfs commented Aug 21, 2024

choice 1: for local regressor, kepler. for sidecar estimator, estimator.

We don't use sidecar estimator any more

@sunya-ch
Copy link
Contributor Author

sunya-ch commented Sep 3, 2024

@sunya-ch sunya-ch closed this as completed Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants