Skip to content

Merge feature/vcpu-runnable to feature/numa8 #6540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 314 commits into
base: feature/numa8
Choose a base branch
from

Conversation

gangj
Copy link
Contributor

@gangj gangj commented Jun 18, 2025

Merge feature/vcpu-runnable as there were some little tweaks for RRD1: #6530

last-genius and others added 30 commits May 7, 2025 14:55
As described in
[#6451](#6451), a xapi
event could prevent update_vm from pulling the latest Xenopsd metadata,
overwriting it with stale information. In case of suspend, this would
make the snapshot unresumable, raising an assert in xenopsd due to
incongruities in memory values.

Instead pull the xenopsd metadata right before updating DB.power_state
in`Xapi_vm_lifecycle.force_state_reset_keep_current_operations`,
eliminating the window for the race.

Closes #6451
Better progress indication, CLI tunables, and the ability to dump raw data to a directory,
that can be used by `ministat`.

Example usage:

```
mkdir /tmp/pool
dune exec ./bench_pool_field.exe --profile=release -- -d /tmp/pool --quota 20
...
dune exec ./bench_pool_field.exe --profile=release -- -d /tmp/pool --quota 20
for i in 'Db.Pool.get_all_records' 'Rpc.t -> pool_t' 'pool_t -> Rpc.t'; do ~/git/ministat/ministat -s "/tmp/pool/${OLD}/${i}.dat" "/tmp/pool/${NEW}/$i.dat" -c 99.5; done
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Write directly to the buffer, instead of writing to a buffer,
returning a string, and writing to a buffer again.

Escaping shows up in performance profiles, because `trusted_on_first_use` field
in the pool contains a JSON, which contains \" characters that need escaping.

Reduces memory allocations from 324.4308 mnw/run to 196.3633 mnw/run.  

Slight change in performance:
```
Db.Pool.get_all_records:
    N           Min           Max        Median           Avg        Stddev
x 384     81858.452     513398.88     87109.044      88314.45     21963.255
+ 389     78006.031     559052.12     83431.429     84780.535     24315.561
Difference at 95.0% confidence
        -3533.91 +/- 3267.84
        -4.00151% +/- 3.63493%
        (Student's t, pooled s = 23176.9)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
The non-standard S-expression serializer wraps all strings in `'`.
The lexer also ignores every non-escaped character after a `'`, until it sees the next
`'` or escape char.

So it should be safe to avoid escaping `"`.

Unescaping is unchanged: any character can be escaped with `\`,
and it returns it unchanged after removing the escape char, so this preserves backwards
compatibility when loading an old database.

Escaping shows up in performance profiles, because `trusted_on_first_use` field
in the pool contains a JSON, which contains \" characters that needed escaping.
With this change it won't anymore.

`ministat` confirms that there is an improvement:
```
sexpr_of_json_string:
    N           Min           Max        Median           Avg        Stddev
x 806     1438.7128     63663.127     1490.3661     1594.5095     2192.0548
+ 850     911.23529     48528.173     967.25054     1037.7674     1632.6855
Difference at 95.0% confidence
        -556.742 +/- 185.531
        -34.9162% +/- 9.28192%
        (Student's t, pooled s = 1925.34)

str_of_sexp_json:
    N           Min           Max        Median           Avg        Stddev
x 792     1622.9135     49591.388     1719.5377     1786.3412     1702.6472
+ 893     605.37329     3354.8035     626.51812     636.34457     107.24734
Difference at 95.0% confidence
        -1150 +/- 111.92
        -64.3772% +/- 2.26583%
        (Student's t, pooled s = 1169.88)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
For backwards compatibility the serialized form looks like 20250319T04:16:24Z,
which had to go through several transformations before it was parsed: 2 sscanf, 2 sprintf,
and then the Ptime parser.

Since we're parsing the format with sscanf anyway, add a fastpath that builds a Ptime.t directly
without going through reformatting the string and reparsing it.

This speeds up API replies that contain dates, like pool which contains 'telemetry_next_collection' as a date.

`ministat` confirms:
```
Date.of_iso8601:
    N           Min           Max        Median           Avg        Stddev
x 786      1703.462     98255.061     1796.5826     2031.7502     3858.1277
+ 905      525.1954     73923.347     558.17972     711.02195     2732.7547
Difference at 95.0% confidence
        -1320.73 +/- 315.725
        -65.0045% +/- 10.3523%
        (Student's t, pooled s = 3303.82)

Db.Pool.get_all_records:
    N           Min           Max        Median           Avg        Stddev
x 390     76966.273      498179.5     82995.374     84536.667     21266.749
+ 401     69709.657        546133     74811.568     76379.246     23782.821
Difference at 95.0% confidence
        -8157.42 +/- 3147.12
        -9.64957% +/- 3.57009%
        (Student's t, pooled s = 22577.4)

Rpc.t -> pool_t  :
    N           Min           Max        Median           Avg        Stddev
x 554     16945.375     267477.23     17620.482      18195.16     10648.914
+ 594     11432.375     251226.67     12011.373     12493.367     9824.9986
Difference at 95.0% confidence
        -5701.79 +/- 1184.38
        -31.3369% +/- 5.53746%
        (Student's t, pooled s = 10230.9)

str_of_sexp_json:
    N           Min           Max        Median           Avg        Stddev
x 893     605.37329     3354.8035     626.51812     636.34457     107.24734
+ 911     510.23412     3298.9936     523.73684     532.25916     98.007142
Difference at 95.0% confidence
        -104.085 +/- 9.47756
        -16.3568% +/- 1.36325%
        (Student's t, pooled s = 102.685)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We called the wrong finally here. In general we mark the backtrace in finally as important
(which involves formatting it, this could be optimized separately).
However Mutex.unlock doesn't raise (in well-behaved code, unless you double unlock), so
we can use Fun.protect instead.

Hashtbl.find in xapi_local_session.ml raises every time in the RBAC checks,
this change avoids the costly backtrace formatting (which was discarded by a try/with later anyway).

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We can use Hashtbl.mem instead of catching the exception from Hashtbl.find.

`ministat` confirms:
```
  local_session_hook  :
    N           Min           Max        Median           Avg        Stddev
x 1057     117.49169     15160.913      123.5201     142.19665     464.12468
+ 1166     38.851635      14365.19     41.619012     57.978805     423.21452
Difference at 95.0% confidence
        -84.2178 +/- 36.8873
        -59.2263% +/- 19.5015%
        (Student's t, pooled s = 443.137)
```

Although there are also some unexplained, but reproducible slowdowns in code that isn't touched by this commit at all:
```
str_of_sexp_json  :
    N           Min           Max        Median           Avg        Stddev
x 911     510.16149     3075.0192     523.44043     531.48578     97.085991
+ 892     605.73462     2951.0948     632.27053     652.10806      95.96889
Difference at 95.0% confidence
        120.622 +/- 8.91245
        22.6953% +/- 1.88103%
        (Student's t, pooled s = 96.5349)
```

Perhaps this is due to code layout changes?

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
read_record was marshaling the maps all the time, and some of these maps can be quite big.
Cache the marshaled form in the row itself. 
This is a trade-off between memory usage and performance
(caching increases global memory usage, but avoids repeated allocations every time get_record is called,
which may decrease memory usage/GC pressure under load).

Eventually we may be able to drop the marshaled form, but we need to change the db_rpc remote protocol for that
(it currently cannot marshal/unmarshal values on its own, because it doesn't have access to the type information).


`ministat` confirms:
```
Db.Pool.get_all_records  :
    N           Min           Max        Median           Avg        Stddev
x 400     71008.769     489829.75     75874.584     76957.497     20852.389
+ 871     51638.867     469264.11     54400.946     55511.462     19845.992
Difference at 95.0% confidence
        -21446 +/- 2387.53
        -27.8674% +/- 2.84131%
        (Student's t, pooled s = 20167.8)

sexpr_of_json_string  :
    N           Min           Max        Median           Avg        Stddev
x 862     819.74359     44758.053     851.53894      915.0725     1496.4717
+ 1688     970.16146      51923.24     1024.1987     1097.3771     1748.4532
Difference at 95.0% confidence
        182.305 +/- 136.827
        19.9224% +/- 15.8188%
        (Student's t, pooled s = 1667.57)

str_of_sexp_json  :
    N           Min           Max        Median           Avg        Stddev
x 1896     605.00587     2932.3931     627.51199     640.73142      89.48524
+ 1812     537.56839     3105.1991     550.25275     558.89761     93.633252
Difference at 95.0% confidence
        -81.8338 +/- 5.89411
        -12.7719% +/- 0.864486%
        (Student's t, pooled s = 91.5357)
```

Slight increase in 'sexpr_of_json_string'.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This doesn't have feature flags, they are optimizations.

The pool object has gained a JSON field recently, which slowed it down
significantly, because the `"` was getting escaped (so the fastpath that
checks whether there are any escapable characters or not couldn't be
triggerred).

The date field is also very slow to handle due to various backward
compatibility format transformations. Add a direct parser for the format
used in memory (it'd be good to skip all these
serializing/deserializing, but that is a bigger change).

Handling the database requires several `finally` calls, which perform a
lot of backtrace formatting, and some code keeps raising exceptions even
on happy paths (e.g. Hashtbl.find). We need to fix those too, but we
should also speed up `finally`.

Xapi_local_session was also raising exceptions all the time, replace
this with Hashtbl.mem
Refix with some review comments addressed.

To join a host into a pool with cluster enabled, the host must have one
and only one IP configured on the joining cluster network.

If not, after the host joinied the pool, GFS2 SR cannot be plugged on
the joined host because an IP is required in the cluster network.

Pool join in this scenario has been blocked in XenCenter, here we will
block it inside xapi.

Signed-off-by: Gang Ji <gang.ji@cloud.com>
…ameter

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Used 'dune format-dune-file dune >dune && mv dune.tmp dune'

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
ocaml-lsp-server's 'Action->Infer interface' was used for this,
and then removed internal/unused entries.

Having an mli file is useful:
* can find dead code more easily if it is clear which functions are internal to a module and which aren't
* the impact of refactoring changes is more obvious (did we have to change an mli at all?)
* makes it easier to understand what a module does

There are some .ml files which only contain types, these are instead renamed to .mli and dune's `modules_without_implementation` feature is used.
Executables get an empty .mli (recent versions of dune already do the equivalent of this, but this makes it more obvious).

Drop code from .ml files that now show up as unused.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…ameter (#6459)

This would eventually lead to compilation errors with newer versions of
GCC, best fix it now by renaming the parameter.
ocaml-lsp-server's 'Action->Infer interface' was used for this, and then
removed internal/unused entries.

Having an mli file is useful:
* can find dead code more easily if it is clear which functions are
internal to a module and which aren't
* the impact of refactoring changes is more obvious (did we have to
change an mli at all?)
* makes it easier to understand what a module does

There are some .ml files which only contain types, these are instead
renamed to .mli and dune's `modules_without_implementation` feature is
used.
Executables get an empty .mli (recent versions of dune already do the
equivalent of this, but this makes it more obvious).

Drop code from .ml files that now show up as unused.

Eventually we can also add some documentation, this will be done in
followup PRs for new code.
On XS9 a previiously available SM type might become unavailable. Make
sure we remove it on upgrade. The previius fix in CA-408048 did not
work.

Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
The current logic in storage_migrate.ml for mirror failure check is
specific to tapdisk, hence multiplex it.

`pre_deactivate_hook` also has something similar to check for mirror
failure, so do something similar there.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Previously this was deleted in commit 1fe6389 as it was not multiplexed,
but looks like we still need to keep it in storage_mux because sm-cli
needs to make rpc calls to storage_mux when trying to list all the
mirrors to make it work properly, due to the fact that the sr plugins
are stored in the address space of the xapi process.

There are other invocations in sm-cli such as `Storage_migrate.start`
which may have similar problems. But I have left them alone as I don't
any reasonable way of calling them from the cli.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Do the same as DATA.MIRROR.list

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
No functional change

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
No functional change

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Use an attribute to change the name of the fields instead of serializing with one name,
and then mapping to the other.

`ministat` confirms an improvement on `rpc_of_event`:
```
    N           Min           Max        Median           Avg        Stddev
x 922     440.79126     46094.975     467.83727     563.79259     1676.0843
+ 1131     51.346821     33407.866     58.715393     115.98006     1105.4178
Difference at 95.0% confidence
        -447.813 +/- 120.966
        -79.4286% +/- 13.1488%
        (Student's t, pooled s = 1390.95)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Atomic has lower overhead, and can be used for simple situations (incrementing an integer, setting a boolean).

```
Mutex+incr (ns):
 { monotonic-clock per run = 32.272091 (confidence: 32.365531 to 32.170138);
   r² = Some 0.999441 }
Atomic.incr (ns):
 { monotonic-clock per run = 2.938678 (confidence: 2.944486 to 2.933688);
   r² = Some 0.999857 }
```


No functional change

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
No measurable performance change, but this makes it clearer for deadlock detection algorithms
that we are waiting to acquire a lock.
It'll also make it clearer for OCaml 5's thread sanitizer that values are modified with a lock held.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
No functional change

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
The double ref is not needed, replace it with an Atomic.t

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
GabrielBuica and others added 27 commits June 20, 2025 15:58
Instruments `process`/`Consumers` spans of message-switch service in
xenopsd.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Intruments functions in `xapi_xenops.ml` that carry the
traceparent/tracecontext through dbg.

`Debug_info.with_dbg` now accepts setting up attributes for spans.

Also, instruments `Events_from_xenopsd` to capture the event spans:
`subscribe`/`settle`. It's nto straight forward to link them on the same
trace. For now the only way they are connnected is having the same
`message.id` attribute.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Intruments the functions in `xapi_xenops.ml` that carry the
traceparent/tracecontext through `context`.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
This function is called recursively and the context span is only closed
on failure. This makes it hard to read the trace.

Therefore, I am resetting the context tracing each recursion step and now
each recursion step will have its own trace.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
events_from_xenopsd thread is critical as it sync up VM status in case
of bootstorm, this thread is flood as lots of events comes from xenopsd
waiting for process.

During processing of the events, VM/VDI/VBD update_allowed_operations
will be called to refresh the allowed operations.

However, for each ops (start/suspend,etc) for the same object(VM), the
object info is always the same no matter what the ops is. Thus, it is
not necessary to query the object information over and over again.

Disclosure is used to resovle the issue. Query once and the disclosure
will just remember the query result.

The performance test for starting 500 VM on 4 hosts improve around 10%
performance for both XS8 and XS9
1. WLB request will raise `wlb_authentication_failed` when catching
   `Http_client.Http_error`. But actually only error code 401 and 403
   should raise this type exception. For other error code, raise
   `wlb_connection_reset`. Also print the detail error code and
   message.
2. `message_forwarding` raises same error for `Http_request_rejected`
   and `Connection_reset` so we don't know which exception actually
   be raised. Print detailed logs for these 2 exceptions.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
…xenops.ml` (#6527)

As promised, more tracing instrumentation around message switch in
xenopsd.

There is also some instrumentation of `Events_from_xenops` around
waiting/waking up for events. Currently these are not straight forward
linkable as they have different traces, e.g. one is on the vm op and the
other in the xapi event watch thread.

It is now possible to search for for "`subscribe`/`settle` <queue_name>"
spans that have the same attribute value for "messaging.message.id".
This should help debugging.

![image](https://github.com/user-attachments/assets/4a408e97-cab0-4380-bf3a-fd5c3619e126)

![image](https://github.com/user-attachments/assets/f6620cc5-82a0-4d1e-a3ea-6088e79e25e2)
"data/updated" is not read or used anywhere in xenopsd or xapi:
* xapi_guest_agent's last_updated field is just Unix.gettimeofday ().
* xapi_xenops removes "data/updated" from the guest agent state altogether before
checking if it's changed:
```
let ignored_keys = ["data/meminfo_free"; "data/updated"; "data/update_cnt"]
```

So there is no need to watch this path at all.

This greatly reduces unnecessary traffic between xapi and xenopsd, since any VM
with a guest agent would write to data/updated once every 60 seconds, which
would generate a Dynamic.Vm event, making xapi call xenopsd's VM.stat to rescan
the domain's xenstore tree and perform several hypercalls. Almost always, this
would be completely unnecessary as nothing else about the VM would change, but
a lot of work would be done anyhow.

Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
1. WLB request will raise `wlb_authentication_failed` when catching
`Http_client.Http_error`. But actually only error code 401 and 403
should raise this type exception. For other error code, raise
`wlb_connection_reset`. Also print the detail error code and message.
2. `message_forwarding` raises same error for `Http_request_rejected`
and `Connection_reset` so we don't know which exception actually be
raised. Print detailed logs for these 2 exceptions.
Drop the first member of the (X.t * X.state) tuple coming from X.stat
immediately as it's never used. This removes the need for several 'snd info',
'Option.iter (fun (_, state) ->) info' constructs.

Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
Rename the "u" function used across storage and observer_helpers to make
its meaning more obvious and use __FUNCTION__ instead of hardcoding the
function name.

Signed-off-by: Steven Woods <steven.woods@cloud.com>
Some refactoring, but the most important commit is this one-line change
that greatly improves things on its own:

---

xenopsd: Remove `data/updated` from the list of watched paths

"data/updated" is not read or used anywhere in xenopsd or xapi:
* xapi_guest_agent's last_updated field is just Unix.gettimeofday ().
* xapi_xenops removes "data/updated" from the guest agent state
altogether before
checking if it's changed:

```
let ignored_keys = ["data/meminfo_free"; "data/updated"; "data/update_cnt"]
```

So there is no need to watch this path at all.

This greatly reduces unnecessary traffic between xapi and xenopsd, since
any VM
with a guest agent would write to data/updated once every 60 seconds,
which
would generate a Dynamic.Vm event, making xapi call xenopsd's VM.stat to
rescan
the domain's xenstore tree and perform several hypercalls. Almost
always, this
would be completely unnecessary as nothing else about the VM would
change, but
a lot of work would be done anyhow.
Rename the "u" function used across storage and observer_helpers to make
its meaning more obvious and use __FUNCTION__ instead of hardcoding the
function name.
The `ref` parameter within the `location` attribute of the console can
refer to either the VM's ref or the console's own ref. Currently, the
console's location uses the VM's ref by default.

This causes an issue: when executing xs console, the requested
location contains the VM's ref. If the ref points to the VM, xapi will
attempt to use the RFB console (which is graphical) by default, rather
than the VT100 console (which is text-based). As a result, the xs
console command fails to open the console and hangs.

**Solution:**
Update the default ref in the console's `location` to the console's
own ref. With this change, whether accessing the RFB or the VT100
console, the ref in the `location` will always point to the respective
console itself.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Add details on specifying image format for VDI and VM migration.
In particular This revision explains how to choose the destination image
format during VDI creation and migration, including VM migration
scenarios.
Also fixes minor typos in the document.

Signed-off-by: Guillaume <guillaume.thouvenin@vates.tech>
When shutdown a VM, xapi will check SM API version to decide if to
call `post_deactivate_hook`. But if the SR has already been unplugged,
the checking will fail.

Solution:
Check if the dp of the SR still exists. If not, skip the SM API
checking.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Add details on specifying image format for VDI and VM migration. In
particular This revision explains how to choose the destination image
format during VDI creation and migration, including VM migration
scenarios.
Also fixes minor typos in the document.
This reverts commit 9e6fb15

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
On xen versions that don't support this call yet, xenctrlext will simply fail
and continue to behave like before.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This interface is not yet available in xen, so fail before doing the hypercall.
This patch is meant to be reverted on system that provide the new interface for
easily test it.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
These commits allow to minimize the differences between the versions
that use a xen with the NUMA-enabled claim_pages, and the ones that are
not.

Now it's a single patch with 4 lines of code:
- 2 lines to change the C binding for domain_claim_pages
(ocaml/xenopsd/c_stubs/xenctrlext_stubs.c)
- 2 lines to remove the raising of exception to try to use the C binding
with a single NUMA node (ocaml/xenopsd/xc/xenctrlext.ml)
The `ref` parameter within the `location` attribute of the console can
refer to either the VM's ref or the console's own ref. Currently, the
console's location uses the VM's ref by default.

This causes an issue: when executing xe console, the requested location
contains the VM's ref. If the ref points to the VM, xapi will attempt to
use the RFB console (which is graphical) by default, rather than the
VT100 console (which is text-based). As a result, the xs console command
fails to open the console and hangs.

**Solution:**
Update the default ref in the console's `location` to the console's own
ref. With this change, whether accessing the RFB or the VT100 console,
the ref in the `location` will always point to the respective console
itself.
Most of these have been unused for almost 10 years since c4ccc56
("CA-217842: Replace instances of vm_lacks_feature_x with vm_lacks_feature")

Drop them.

Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
Most of these have been unused for almost 10 years since c4ccc56
("CA-217842: Replace instances of vm_lacks_feature_x with
vm_lacks_feature")

Drop them.
When shutdown a VM, xapi will check SM API version to decide if to call
`post_deactivate_hook`. But if the SR has already been unplugged, the
checking will fail.

Solution:
Check if the plugin of the SR still exists. If not, skip the SM API
checking.
@psafont
Copy link
Member

psafont commented Jun 27, 2025

I think it's worth repeating again, I don't think it's worth maintaining a long-lived branch for what is effectively 4 commits. I say this because there are conflicts in the merge

@mg12
Copy link
Member

mg12 commented Jun 27, 2025

after the discussions we had in the last couple of weeks, I think the changes are simple enough for runnable_any and runnable_vcpus that they can be merged directly to master without using this feature branch feature/numa8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.