Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submission slows down on Hetchy #1001

Closed
jameshcorbett opened this issue Feb 4, 2023 · 27 comments
Closed

Job submission slows down on Hetchy #1001

jameshcorbett opened this issue Feb 4, 2023 · 27 comments
Labels
performance Fluxion performance and scalability

Comments

@jameshcorbett
Copy link
Member

jameshcorbett commented Feb 4, 2023

[corbett8@hetchy12:Flux]$ for (( i=0; i<50; i++ )); do /usr/bin/time --format="%e" flux mini submit -N1 hostname; done
ƒMbBRgJXJ7Z
0.67
ƒMbBRrMARs5
0.39
ƒMbBS2HscVD
0.38
ƒMbBSGKnpCX
0.55
ƒMbBSfMZf7D
0.89
ƒMbBTD9rFgB
1.24
ƒMbBTusYXbV
1.59
ƒMbBUmcaRzX
1.94
ƒMbBVnFY2TZ
2.29
ƒMbBWwjTLRu
2.64
ƒMbBYGLeE1M
2.99
ƒMbBZjjprXR
3.34
... [snip]
ƒMbCkEwA7P5
11.29
ƒMbCqRWQW9u
11.75
... [snip]
ƒMbDajaVaLX
24.42
@jameshcorbett jameshcorbett changed the title Job _submission_ slows down on Hetchy Job submission slows down on Hetchy Feb 4, 2023
@jameshcorbett
Copy link
Member Author

By contrast, on Corona the timings are very consistent

[corbett8@corona211:~]$ for (( i=0; i<50; i++ )); do /usr/bin/time --format="%e" flux mini submit -N1 hostname; done
ƒeXEwE86VWf
0.54
ƒeXEwLcfM6F
0.25
ƒeXEwT2nEpo
0.24
ƒeXEwZPw9zf
0.25
ƒeXEwfqY31Z
0.25
ƒeXEwnFevk7
0.24
ƒeXEwtijo3M
0.25
ƒeXEx17NhVZ
0.29
ƒeXEx8ej3dH
0.29
ƒeXExGhj9Pu
0.27
ƒeXExQDbWFH
0.35

@grondo
Copy link
Contributor

grondo commented Feb 4, 2023

Oh, I didn't quite catch that this was job submission that was slowing down here! That is quite unexpected and should not happen. In general, responses to a job.submit RPC should be very fast -- nothing should block in the submission path. However on hetchy I just did a test and one job.submit response took 94 seconds!

We'll have to get to the bottom of this!

@grondo
Copy link
Contributor

grondo commented Feb 4, 2023

I did notice on rank 0 the broker was at 100% CPU. A perf report shows Fluxion using 98% of the cycles:

-   91.69%     0.00%  flux-broker-0    sched-fluxion-resource.so               ◆
   - 86.96% 00007fffa7d199f2                                                   ▒
        0x7fffa7d13ddd                                                         ▒
        Flux::resource_model::dfu_traverser_t::run                             ▒
        Flux::resource_model::dfu_traverser_t::schedule                        ▒
        Flux::resource_model::detail::dfu_impl_t::select                       ▒
        Flux::resource_model::detail::dfu_impl_t::dom_dfv                      ▒
        Flux::resource_model::detail::dfu_impl_t::dom_exp                      ▒
        Flux::resource_model::detail::dfu_impl_t::explore_statically           ▒
        Flux::resource_model::detail::dfu_impl_t::dom_dfv                      ▒
        Flux::resource_model::detail::dfu_impl_t::dom_slot                     ▒
        Flux::resource_model::detail::dfu_impl_t::explore_statically           ▒
        Flux::resource_model::detail::dfu_impl_t::dom_dfv                      ▒
        Flux::resource_model::multilevel_id_t<Flux::resource_model::fold::less>▒
        ?? (inlined)                                                           ▒
        mod_main (inlined)                                                     ▒
   + 1.33% 0x7fffa7d199f2                                                      ▒
   + 0.57% 0x7fffa7d199f2                                                      ▒
+   91.60%     0.00%  flux-broker-0    sched-fluxion-resource.so               ▒
+   91.58%     0.00%  flux-broker-0    sched-fluxion-resource.so              

So my guess the slowness in job submission is due to the feasibility check in the job-validator, and the root cause is something going very wrong in Fluxion. Doesn't hetchy have some special resources added to its graph with JGF? You may be able to reproduce this issue in a test instance by loading similar fake resources.

Note that the feasibility checks are reasonably fast, until you start running jobs, then this script reproduces the slowness just with the validator:

# no running jobs
$  for i in `seq 1 10`; do /usr/bin/time --format="%e" sh -c 'flux mini run --dry-run -N1 hostname | flux job-validator --plugins=feasibility,jobspec --jobspec-only' ; done
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48

# with running jobs
[flux@hetchy7:~]$  for i in `seq 1 10`; do /usr/bin/time --format="%e" sh -c 'flux mini run --dry-run -N1 hostname | flux job-validator --plugins=feasibility,
,jobspec --jobspec-only' ; done
{"errnum": 0}
2.04
{"errnum": 0}
2.65
{"errnum": 0}
3.19
{"errnum": 0}
3.72
{"errnum": 0}
4.24
{"errnum": 0}
9.65

@vsoch
Copy link
Member

vsoch commented Feb 4, 2023

We'll have to get to the bottom of this!

To the bottom!

@jameshcorbett
Copy link
Member Author

Doesn't hetchy have some special resources added to its graph with JGF? You may be able to reproduce this issue in a test instance by loading similar fake resources.

Rabbit systems in general will, but at the moment Hetchy doesn't. Fluxion doesn't know anything about the rabbits. So that isn't the culprit.

I'll test more when I get back from travel.

@grondo
Copy link
Contributor

grondo commented Feb 4, 2023

Rabbit systems in general will, but at the moment Hetchy doesn't. Fluxion doesn't know anything about the rabbits. So that isn't the culprit.

Ah, thanks for that information. If there's nothing special about hetchy resource graph at the moment, then this issue has the potential to affect any system. I'll transfer this issue to flux-sched, because I'm fairly certain the flux-coral2 bits have nothing to do with the problem (I even removed the two jobtap plugins just to test).

@grondo grondo transferred this issue from flux-framework/flux-coral2 Feb 4, 2023
@grondo
Copy link
Contributor

grondo commented Feb 4, 2023

A good test may be to try reloading sched-fluxion-qmanager and sched-fluxion-resource to see if the problem goes away. However, we may want to collect as much information from the affected system before doing that to better understand the problem. cc: @trws and @milroy to see if they have any other ideas.

@grondo
Copy link
Contributor

grondo commented Feb 6, 2023

I started a test instance with the same R as configured on hetchy and could not reproduce the issue, so the cause here isn't the specific configuration of resources. Not sure how to debug the live system.

@garlick
Copy link
Member

garlick commented Feb 6, 2023

In case it proves useful in recreating this, the current resource state and properties (queue) config is

$ flux resource list
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free windom         13     1664        0 hetchy[13,16-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down windom          2      256        0 hetchy[14-15]
$ flux resource status
    STATUS NNODES NODELIST
     avail     15 hetchy[13,16-29]
   offline      2 hetchy[14-15]
   exclude      2 hetchy[7,12]
   drained      2 hetchy[14-15]

@grondo
Copy link
Contributor

grondo commented Feb 6, 2023

This problem is reproducible by collecting some of the config from hetchy:
Note: I've added resource.noverify = true here.

[job-manager]
plugins = [
  { load = "perilog.so" },
  { load = "/opt/lib64/flux/job-manager/plugins/cray_pals_port_distributor.so", conf = { port-min = 11000, port-max = 12000 } },
  { load = "/opt/lib64/flux/job-manager/plugins/dws-jobtap.so" }
]

[policy.limits]
duration = "24h"

[queues.windom]
requires = ["windom"]

[queues.bardpeak]
requires = ["bardpeak"]

[resource]
path = "/etc/flux/system/R"
#exclude = "hetchy[7,12]"
norestrict = true
noverify = true

[sched-fluxion-qmanager]
# easy backfill
queue-policy = "easy"

[sched-fluxion-resource]
# node exclusive starting from low node ids
match-policy = "lonodex"
match-format = "rv1_nosched"

system.R

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0",
        "children": {
          "core": "0-63"
        }
      },
      {
        "rank": "1",
        "children": {
          "core": "0-127"
        }
      },
      {
        "rank": "2-16",
        "children": {
          "core": "0-127"
        }
      },
      {
        "rank": "17-18",
        "children": {
          "core": "0-63",
          "gpu": "0-7"
        }
      }
    ],
    "starttime": 0.0,
    "expiration": 0.0,
    "nodelist": [
      "hetchy[7,12-29]"
    ],
    "properties": {
      "windom": "2-16",
      "bardpeak": "17-18"
    }
  }
}

Instructions:

$  FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1
 grondo@hetchy12:~/git/flux-sched$ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1
Failed to open drm root directory /sys/class/drm.: No such file or directory
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux module remove sched-fluxion-qmanager
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux module remove sched-fluxion-resource
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux module remove resource
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux config load < conf.toml 
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux kvs put -r resource.R=- </etc/flux/system/R
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux module load resource
Failed to open drm root directory /sys/class/drm.: No such file or directory
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux module load sched-fluxion-resource
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux module load sched-fluxion-qmanager
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux resource list
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
f(s=1,d=0) grondo@hetchy12:~/git/flux-sched$ flux mini submit --queue=windom --cc=1-50 --setattr=exec.test.run_duration=1ms --quiet --watch --progress --jps hostname
PD:0  R:0  CD:50 F:0  │███████████████████████████████████████│100.0%  1.6 job/s

@grondo
Copy link
Contributor

grondo commented Feb 6, 2023

Getting a similar result from perf:

-   99.93%     0.00%  flux-broker-0  [.] ev_run                                                                                                         ▒
     ev_run                                                                                                                                             ▒
   - ev_run                                                                                                                                             ▒
      - 99.93% ev_invoke_pending                                                                                                                        ▒
         - 99.68% handle_cb                                                                                                                             ▒
            - 99.60% dispatch_message (inlined)                                                                                                         ▒
                 call_handler                                                                                                                           ▒
               - 0x15555542c14d                                                                                                                         ▒
                  - 97.16% 0x155555426bd0                                                                                                               ▒
                       Flux::resource_model::dfu_traverser_t::run                                                                                       ▒
                       Flux::resource_model::dfu_traverser_t::schedule                                                                                  ▒
                     - Flux::resource_model::detail::dfu_impl_t::select                                                                                 ▒
                        - 96.61% Flux::resource_model::detail::dfu_impl_t::dom_dfv                                                                      ▒
                           - 96.50% Flux::resource_model::detail::dfu_impl_t::dom_exp                                                                   ▒
                                Flux::resource_model::detail::dfu_impl_t::explore_statically                                                            ▒
                              - Flux::resource_model::detail::dfu_impl_t::dom_dfv                                                                       ▒
                                 - 95.77% Flux::resource_model::detail::dfu_impl_t::dom_slot                                                            ▒
                                    - 94.17% Flux::resource_model::detail::dfu_impl_t::explore_statically                                               ▒
                                       - 92.22% Flux::resource_model::detail::dfu_impl_t::dom_dfv                                                       ▒
                                          - 86.23% Flux::resource_model::multilevel_id_t<Flux::resource_model::fold::less>::dom_finish_vtx              ▒
                                             - 86.12% ?? (inlined)                                                                                      ▒
                                                  mod_main (inlined)                                                                                    ▒
                                          + 5.04% Flux::resource_model::detail::dfu_impl_t::dom_exp                                                     ▒
                                       + 0.63% Flux::resource_model::detail::evals_t::add                                                               ◆
                                       + 0.51% ?? (inlined)                                                                                             ▒

@ryanday36
Copy link

When I was looking at flux dmesg hetchy, I noticed that there were errors from flux cron. When we first brought hetchy up, we mistakenly configured it to run flux accounting in ansible. I never set up the accounting db, and I cleaned up the configs in ansible, but I apparently never cleaned up the accounting cron job on hetchyi. That runs:

/bin/bash -c "/bin/flux account update-usage --priority-decay-half-life 1 /var/lib/flux/job-archive.sqlite; /bin/flux account-update-fshare; /bin/flux account-priority-update"

I'm not sure it's relevant to this, but I can imagine the account-priority-update giving fluxion some garbage data when there's no accounting db, so I thought I'd mention it.

I deleted the cron.d/accounting file, but I haven't removed the job using flux cron yet.

@garlick
Copy link
Member

garlick commented Feb 7, 2023

I don't think that could be it. The accounting scripts communicate with the job-archive database and the mf_priority.so plugin in the job manager. But we're seeing fluxion spending lots of time traversing resource graphs while answering feasibility queries at job submission time. It seems like that wouldn't be affected by say crazy job priorities or even the job manager dealing with an onslaught of messages.

@grondo
Copy link
Contributor

grondo commented Feb 7, 2023

FYI - I didn't get different results running perf with perf record -g --call-graph=dwarf as noted here to make sure we're getting valid backtraces. (I'm pretty sure I did that in the first place, but wanted to double check)

@grondo
Copy link
Contributor

grondo commented Feb 7, 2023

Here's a script that acts as a reproducer run out of a top-level flux-sched builddir:

#!/bin/sh
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux kvs put -r resource.R=- </etc/flux/system/R
flux config load < ./conf.toml
flux module load resource noverify
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue status
flux resource list
flux module list | grep sched
flux mini submit --queue=windom --cc=1-100 --setattr=exec.test.run_duration=1ms --quiet --watch --progress --jps hostname

conf.toml has selected config entries from the hetchy system config:

[job-manager]
plugins = [
  { load = "perilog.so" },
  { load = "/opt/lib64/flux/job-manager/plugins/cray_pals_port_distributor.so", conf = { port-min = 11000, port-max = 12000 } },
  { load = "/opt/lib64/flux/job-manager/plugins/dws-jobtap.so" }
]

[policy.limits]
duration = "24h"

[queues.windom]
requires = ["windom"]

[queues.bardpeak]
requires = ["bardpeak"]

[resource]
path = "/etc/flux/system/R"
#exclude = "hetchy[7,12]"
norestrict = true
noverify = true

[sched-fluxion-qmanager]
# easy backfill
queue-policy = "easy"

[sched-fluxion-resource]
# node exclusive starting from low node ids
match-policy = "lonodex"
match-format = "rv1_nosched"

[tbon]
tcp_user_timeout = "2m"

The reproducer can now be run like:

$ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1 ./issue#1001.sh 
 grondo@hetchy12:~/git/flux-sched$ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1 ./issue#1001.sh 
Failed to open drm root directory /sys/class/drm.: No such file or directory
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-qmanager    6225784 c4a4497    0  R sched
sched-fluxion-resource   36422384 e68a19e    0  R 
PD:0   R:0   CD:100 F:0   │███████████████████████████████████│100.0%  0.8 job/s

As suggested by @trws, I set the queue-depth to 2 and re-ran the test, which has marked improvement

$ grep queue-depth conf.toml 
queue-params.queue-depth = 2
$ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1 ./issue#1001.sh 
Failed to open drm root directory /sys/class/drm.: No such file or directory
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-qmanager    6225784 c4a4497    0  R sched
sched-fluxion-resource   36422384 e68a19e    0  R 
PD:0   R:0   CD:100 F:0   │███████████████████████████████████│100.0% 28.5 job/s

Note in the conf.toml above, the ingest feasibility validator is not enabled, so this reproducer is not using the scheudler's satisfiability RPC. (contrary to what I thought at first)

@milroy
Copy link
Member

milroy commented Feb 23, 2023

@trws made some good suggestions about further performance improvements for graph traversal during the last Fluxion hackathon. I tested them with the focal flux-core docker image on my laptop.

  1. Baseline (lonodex policy, code base from PR Improve Fluxion match performance #1007):
milroy1@docker-desktop:/usr/src$ ./test.sh 
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T22:38:11.782934Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-resource   37767648 16eaad7    0  R 
sched-fluxion-qmanager    8646664 6d1d3e2    0  R sched
PD:0   R:0   CD:100 F:0   │████████████████████████████████████████████████████████|100.0%  2.5 job/s
  1. Use boost::vecS container instead of boost::listS (used implicitly) for the graph EdgeList (
    using resource_graph_t = boost::adjacency_list<boost::vecS,
    ):
milroy1@docker-desktop:/usr/src$ ./test.sh
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T22:34:21.454179Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-qmanager    8646664 6d1d3e2    0  R sched
sched-fluxion-resource   37739848 88cc2bf    0  R 
PD:0   R:0   CD:100 F:0   │████████████████████████████████████████████████████████│100.0%  2.4 job/s
  1. Use unordered_set and unordered_map instead of the RBtree-based set and map (
    using multi_subsystemsS = std::map<subsystem_t, std::set<std::string>>;
    ):
milroy1@docker-desktop:/usr/src$ ./test.sh 
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T23:47:22.846822Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-qmanager    8646664 6d1d3e2    0  R sched
sched-fluxion-resource   38006288 a02a228    0  R 
PD:0   R:0   CD:100 F:0   │████████████████████████████████████████████████████████│100.0%  1.6 job/s
  1. Use unordered_map instead of map (
    using multi_subsystems_t = std::map<subsystem_t, std::string>;
    ):
milroy1@docker-desktop:/usr/src$ ./test.sh 
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T23:55:15.292809Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-resource   37887904 e1cc9e0    0  R 
sched-fluxion-qmanager    8646664 6d1d3e2    0  R sched
PD:0   R:0   CD:100 F:0   │████████████████████████████████████████████████████████│100.0%  2.0 job/s
  1. Use unordered_map instead of the RBtree-based map (
    using multi_subsystemsS = std::map<subsystem_t, std::set<std::string>>;
    ):
milroy1@docker-desktop:/usr/src$ ./test.sh 
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-23T00:05:26.933375Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-qmanager    8646664 6d1d3e2    0  R sched
sched-fluxion-resource   37832352 9ac6420    0  R 
PD:0   R:0   CD:100 F:0   │████████████████████████████████████████████████████████│100.0%  1.9 job/s

@trws
Copy link
Member

trws commented Feb 23, 2023

Wow, almost no impact to negative impact. That's unfortunate, but really good to know. I found some things, thanks to help from @grondo, that might be impacting us from the Constraints. Once that bug is squashed we can take another pass on the performance. I'm guessing we're not getting a win from the unordered maps because of the hashing cost, which we could fix with pre-hashed strings or interned strings, might be worth doing the string rework and then circling back to the data types, even if that's more painful. =/

@tpatki
Copy link
Member

tpatki commented Feb 23, 2023

@milroy Are you running this in docker container?
Have you tried just building on bare metal on hetchy to see if your test has the same jobs/s to eliminate any container-related overheads? May not have any impact but may be worth testing

@milroy
Copy link
Member

milroy commented Feb 23, 2023

Yes, I've tested the reproducer directly on hetchy. In my tests the reproducer runs faster in the container running on my laptop. Hetchy and the container exhibit similar performance characteristics.

@milroy
Copy link
Member

milroy commented Feb 23, 2023

Ok, so not faster in the container than hetchy for current Fluxion master (in fact, faster on hetchy), but really close:
Case 1. above on hetchy:

[milroy1@hetchy12:flux-sched]$ ./tests.sh 
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-resource   36405416 db97e26    0  R 
sched-fluxion-qmanager    8496408 3022bef    0  R sched
PD:0   R:0   CD:100 F:0   │██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│100.0%  2.6 job/s

Case 3 above on hetchy:

[milroy1@hetchy12:flux-sched]$ ./tests.sh 
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free                 2      192        0 hetchy[7,12]
      free windom         15     1920        0 hetchy[13-27]
      free bardpeak        2      128       16 hetchy[28-29]
 allocated                 0        0        0 
      down                 0        0        0 
sched-fluxion-qmanager    8496408 3022bef    0  R sched
sched-fluxion-resource   37031616 df9e2f5    0  R 
PD:0   R:0   CD:100 F:0   │██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│100.0%  1.8 job/s

@milroy
Copy link
Member

milroy commented Feb 27, 2023

@jameshcorbett can we close this issue? PR #1007 fixes the job slowdown, but it doesn't comprehensively solve Fluxion performance problems.

I can create a new issue to continue the investigation into general performance problems with Fluxion.

@grondo
Copy link
Contributor

grondo commented Feb 27, 2023

I think this issue still applies since Hetchy (and all our systems) are using node exclusive policy? We're still at 10-20x slowdown. It might be nice to keep all the history in one issue.

However, if you'd really like to create a new issue that's fine with me.

@trws
Copy link
Member

trws commented Feb 27, 2023

Out of curiosity, has anyone run a perf test on the versions before we noticed this to see if we had a regression on lonodex or if it's something that just showed up? I know the hetchy config and queues are part of it, but if we had low but predictable performance with lonodex before we might not have seen it.

@milroy
Copy link
Member

milroy commented Mar 6, 2023

We're still at 10-20x slowdown.

Is the submission of sequential jobs still slowing down on Hetchy? So the example @jameshcorbett gave in the first comment where submission times increase from <1s to 24s is still occurring with *nodex policies?

@grondo
Copy link
Contributor

grondo commented Jun 26, 2023

Note as shown by results in #1009, this performance issue also occurs with or without node exclusive scheduling when moderate amounts of resources are involved in scheduling (in the examples, 2000 nodes).

@trws
Copy link
Member

trws commented Jul 31, 2024

This should be addressed at this point, @jameshcorbett does this still repro for you in any cases?

@jameshcorbett
Copy link
Member Author

Nope, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Fluxion performance and scalability
Projects
None yet
Development

No branches or pull requests

8 participants