Skip to content

Conversation

@liulinC
Copy link
Owner

@liulinC liulinC commented May 9, 2024

No description provided.

minglumlu and others added 30 commits February 27, 2024 19:12
This reverts commit a53e54d.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
This reverts commit 54039f3.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
This reverts commit db91ddf.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
This should only report errors on lines that are changed in a PR, and not block merges for pre-existing bugs.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…389206

CA-389206: Revert more changes in CLI protocol
This will allow to handle serialization of key as well as states in
server_interface and the write cache

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This enables xapi-guard to decouple persistence of TPM contents from the xapi
service being online. That is, when xapi is down. The contents of the TPMs will
be written to disk, and when xapi is back online the contents will be uploaded.

This is needed to protect VMs while xapi is being restarted, usually as part of
an update.

Some properties of the cache:
- The cache is tried to be bypassed whenever possible, and is only used as
  fallback after a write fails.
- The cache is handled by a thread that writes to cache and one that reads from
  it. They communicate through a bounded queue.
- Whenever a TPM content is written to disk, previous versions of it are
  deleted. This helps the reading thread to catch up.
- When the queue has been filled the writer stops adding elements to the queue,
  and the reader will try to flush the queue, and after it will try to flush
  the cache. After this happens both threads will transition to cache bypass
  operation.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This allows to pass the UUID directly to the on-disk cache that will be
introduced

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This allows to use the persistence function from outside the callback, which
will be useful to thread into the on-disk cache

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Now the process creates a thread to read from disk and push vtpm events to xapi
at its own pace, and integrates the disk-writing part into the callback of the
deprivileged sockets.

Special consideration was taken for the resume, when the deprivileged sockets
and the write-to-cache function need to be integrated in a different way from
the codepath that creates the sockets from the message-switch server.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Because timestamps depend on a monotonic timestamp that depends on boot, files
need to be renamed to ensure future writes have higher timestamps to be
considered newer and be uploaded to xapi.

On top of this, allows to report about remnant temporary files, delete invalid
files and remove empty directories.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This is needed to a be able to disable the disk cache completely, maintaining
previous behaviour if needed.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This is done through the fist point.

Xapi_fist is not used directly because it needs to a new opam package, creating
a lot of churn which is currently unwanted.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Exposed GFS2_CAPACITY in the known message types (for the purpose of …
Now all domains' vtpm read requests go through the cache. The read function is
the same as before.

There is no change in behaviour

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This is intended as the start of a new directory structure. Splitting
the python3-only scripts into a new directory gives a simple way to
exclude them from python2 tests and coverage.

Signed-off-by: Steven Woods <steven.woods@citrix.com>
For domains requesting the TPM's contents, the xapi-guards returns the contents
in the cache, if they are available from in-flight requests. It falls back to
xapi if that couldn't be possible.

The cache doesn't try to provide any availability for reads, like it does for
writes. This means that if swtpm issues a read request while xapi is offline,
the request will fail, as it happened before this change.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Previously, they were sorted by string order, which in rare cases might lead to
erroneous ordering

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
It is not supported in scheduled runs, and fails.

Use unique key for shellcheck group

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…-gardon

CA-383867: Add local disk cache library for xapi guard
Co-authored-by: Pau Ruiz Safont <psafont@users.noreply.github.com>
Signed-off-by: Török Edwin <edwintorok@users.noreply.github.com>
…info

Add 'threads_per_core' in 'Host.cpu_info'
`xapi_xenpos.ml` -> `xapi_xenops.ml`

Signed-off-by: Luca Zhang <feiya.zhang@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
robhoes and others added 28 commits April 29, 2024 13:39
…racing-export

Install xapi-tracing-export library
Clear all (stale) scheduled assignments for: VM, PCI, VGPU objects on
startup. We were missing the latter two.

Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
…/CA-392163

CA-392163 clear scheduled assignments on startup
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Instrument:

- `forkhelpers.ml`,
- `fecomms.ml`

to create spans around functions when a parent span is supplied as
`tracing`.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Currently the `observe` mode of the `tracing` library do not work
corrently resulting in the logs being spammed by this warning.

Comment it out so that the logs do not become too big (for the time
being).

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
tests: Allow the alcotest_suite to run
The API call VM.set-has-vendor-device used to be a lincensed feature but
it no longer is. As a first step to simplify when Windows VMs
automatically or not update device drivers, remove the license checks in
the code. The feature flag still remains in place.

Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
The VCustom value was only used for the has-vendor-device field in a VM
and contained code rather than a simple value. We are removing this.

Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
Remove the function and use VM.create_from_record directly.

Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
…/CP-48195-instrument-forkexecd-client

CP-48195: Instrument client side of `forkexecd`
…e build @check

Bytecode builds for `http_lib` are disabled due to '(modes best)',
and that means that anything that depends on it must have it disabled too to avoid this warning.

Avoids these kinds of warnings:
```
File "_none_", line 1:
Error: Module `Buf_io' is unavailable (required by `Http_svr')
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
…ries

There were 3 modules with conflicting names with compiler libraries: Watch,
Debuginfo and Stats. Debuginfo was renamed, the others's libraries were changed
to be wrapped.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Pinning the libraries runs dune subst, which needs a project name, define it

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Do not output loglines that are part of the normal operation. Use debug for
them, they are not usually logged, but can be enabled if need be by changing
the loglevel

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
New version got released

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
The query on HTTP endpoint /updates will return the available updates in
JSON format. Prior to the changes in this commit, if a query arrives
when another query is being handled, a "GET_UPDATES_IN_PROGRESS" error
will be returned immediately. This behaviour is not friendly to GUI
client XenCenter.

In this commit, the behaviour is changed to wait and retry in handling
the query in xapi since the "*_IN_PROGGRESS" error is a transient
failure. Tolerating it in xapi (server) side avoids error handling in
client side.

With the change, the "GET_UPDATES_IN_PROGRESS" will not be an error
exposed to users any more. Therefore it is removed.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
…389319

CA-389319: Wait and retry for GET_UPDATES_IN_PROGRESS
As part of a start, resources are allocated for a VM in "scheduled_to.."
fields. These need to be cleared if the start fails. It turned out that
this was incomplete for PCI slots and those were leaking. This patch
tries to be more systematical about it.

Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
Add a new field `cluster_stack_version` to the cluster datamodel to
track the version of corosync currently in use. This version will
always be set to 3. Also add logic to switch corosync binary and
associated library versions when a cluster is created, if needed.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
This is purely for testing purpose. Normal user is not allowed to create
a corosync2 cluster.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
@liulinC liulinC merged commit 1dc484b into liulinC:master May 9, 2024
liulinC pushed a commit that referenced this pull request May 23, 2024
Backport of 3b52b72

This enables PAM to be used in multithreaded mode (currently XAPI has a global lock around auth).

Using an off-cpu flamegraph I identified that concurrent PAM calls are slow due to a call to `sleep(1)`.
`pam_authenticate` calls `crypt_r` which calls `NSSLOW_Init` which on first use will try to initialize the just `dlopen`-ed library.
If it encounters a race condition it does a `sleep(1)`. This race condition can be quite reliably reproduced when performing a lot of PAM authentications from multiple threads in parallel.

GDB can also be used to confirm this by putting a breakpoint on `sleep`:
```
  #0  __sleep (seconds=seconds@entry=1) at ../sysdeps/unix/sysv/linux/sleep.c:42
  #1  0x00007ffff1548e22 in freebl_RunLoaderOnce () at lowhash_vector.c:122
  #2  0x00007ffff1548f31 in freebl_InitVector () at lowhash_vector.c:131
  #3  NSSLOW_Init () at lowhash_vector.c:148
  xapi-project#4  0x00007ffff1b8f09a in __sha512_crypt_r (key=key@entry=0x7fffd8005a60 "pamtest-edvint", salt=0x7ffff31e17b8 "dIJbsXKc0",
  xapi-project#5  0x00007ffff1b8d070 in __crypt_r (key=key@entry=0x7fffd8005a60 "pamtest-edvint", salt=<optimized out>,
  xapi-project#6  0x00007ffff1dc9abc in verify_pwd_hash (p=p@entry=0x7fffd8005a60 "pamtest-edvint", hash=<optimized out>, nullok=nullok@entry=0) at passverify.c:111
  xapi-project#7  0x00007ffff1dc9139 in _unix_verify_password (pamh=pamh@entry=0x7fffd8002910, name=0x7fffd8002ab0 "pamtest-edvint", p=0x7fffd8005a60 "pamtest-edvint", ctrl=ctrl@entry=8389156) at support.c:777
  xapi-project#8  0x00007ffff1dc6556 in pam_sm_authenticate (pamh=0x7fffd8002910, flags=<optimized out>, argc=<optimized out>, argv=<optimized out>) at pam_unix_auth.c:178
  xapi-project#9  0x00007ffff7bcef1a in _pam_dispatch_aux (use_cached_chain=<optimized out>, resumed=<optimized out>, h=<optimized out>, flags=1, pamh=0x7fffd8002910) at pam_dispatch.c:110
  xapi-project#10 _pam_dispatch (pamh=pamh@entry=0x7fffd8002910, flags=1, choice=choice@entry=1) at pam_dispatch.c:426
  xapi-project#11 0x00007ffff7bce7e0 in pam_authenticate (pamh=0x7fffd8002910, flags=flags@entry=1) at pam_auth.c:34
  xapi-project#12 0x00000000005ae567 in XA_mh_authorize (username=username@entry=0x7fffd80028d0 "pamtest-edvint", password=password@entry=0x7fffd80028f0 "pamtest-edvint", error=error@entry=0x7ffff31e1be8) at xa_auth.c:83
  xapi-project#13 0x00000000005adf20 in stub_XA_mh_authorize (username=<optimized out>, password=<optimized out>) at xa_auth_stubs.c:42
```

`pam_start` and `pam_end` doesn't help here, because on `pam_end` the library is `dlclose`-ed, so on next `pam_authenticate` it will have to go through the initialization code again.
(This initialization code would've belonged into `pam_start`, not `pam_authenticate`, but there are several layers here including a call to `crypt_r`).
Upstream has fixed this problem >5 years ago by switching to libxcrypt instead.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.