Skip to content

Retry instances not running add new Execution list endpoint #793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 19, 2025

Conversation

olethanh
Copy link
Collaborator

@olethanh olethanh commented Apr 10, 2025

Jira ticket: ALEPH-287

This PR improve how Instance are started, making it more reliable and properly cleaning when they fail to start. It also allow to start again the VM if it failed to start the first time.

Changes introduced by this PR:

  1. Decouple instance wait_for_init from allocation. It now return directly after starting the VM controller and run wait_for_init in the background. wait_for_init has been renamed wait_for_boot in VM
  2. Stop the controller when the Instance fail to start. Before if the instance failed to respond to ping the Instance was reported as not running and removed from the list but the Instance systemd controller was still. This tied up resource and prevented the VM from starting again. Also it caused issue with the network.
  3. Introduce a new endpoint /v2/about/executions/list which also include the starting, started, end times etc.. so we can inspect the VM state. and includ non running VM It will allow us in the future to improve the clients
  4. Fix VM inconsistant state calculation between the boot and the end of boot.
  5. Properly clean up ressource if the VM crash and we tried to start it again. Earlier if it was not is_running we just started a new one regardless if it was shutting down or crashed and the resource were not cleaned.
  6. Save debug token inside a protected file so we can still access it if the log rotated
  7. Improve the instances tests:
    • Reduce the problem of contamination of settings between tests
    • ensure we tests in a separat temp dir we don't reuse the cache. Before that the tests missed issue where ressource could not be downloaded because of a coding error but the resource was already cached
    • Better output and debug info

Self proofreading checklist

  • The new code clear, easy to read and well commented.
  • New code does not duplicate the functions of builtin or popular libraries.
  • An LLM was used to review the new code and look for simplifications.
  • New classes and functions contain docstrings explaining what they provide.
  • All new code is covered by relevant tests.
  • Documentation has been updated regarding these changes.
  • Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

In addition to what was listed above, others refactor changes:

  • in tests Setup webapp take the vm_pool argument
  • vm_pool don't take the loop anymore (not needed since Python 3.10)
  • Execution call .enable_and_start and .stop_and_disable, not VmPool
  • .enable_and_start is now async
  • Remove some unused code
  • Better debugging output

How to test

Allocate VM, try allocating VM that fail to start and reallocate again. Kill VM during boot

@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 9 times, most recently from 7694fb3 to f5d096f Compare April 15, 2025 08:41
Copy link

codecov bot commented Apr 15, 2025

Codecov Report

Attention: Patch coverage is 80.24194% with 49 lines in your changes missing coverage. Please review.

Project coverage is 65.33%. Comparing base (18b2fb8) to head (56bc4c3).
Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
src/aleph/vm/models.py 63.46% 15 Missing and 4 partials ⚠️
src/aleph/vm/orchestrator/run.py 0.00% 13 Missing ⚠️
src/aleph/vm/pool.py 54.54% 4 Missing and 1 partial ⚠️
src/aleph/vm/orchestrator/supervisor.py 42.85% 4 Missing ⚠️
tests/supervisor/test_qemu_instance.py 95.08% 2 Missing and 1 partial ⚠️
src/aleph/vm/orchestrator/cli.py 0.00% 2 Missing ⚠️
src/aleph/vm/hypervisors/firecracker/microvm.py 0.00% 1 Missing ⚠️
tests/supervisor/test_instance.py 92.85% 1 Missing ⚠️
tests/supervisor/views/test_operator.py 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #793      +/-   ##
==========================================
+ Coverage   64.88%   65.33%   +0.45%     
==========================================
  Files          79       79              
  Lines        7133     7220      +87     
  Branches      601      602       +1     
==========================================
+ Hits         4628     4717      +89     
- Misses       2304     2309       +5     
+ Partials      201      194       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch from f5d096f to f38053f Compare April 15, 2025 09:46
@olethanh olethanh marked this pull request as ready for review April 15, 2025 10:07
@olethanh
Copy link
Collaborator Author

I will rework the commit cleanly and update the desc but for me this is ready for review

@olethanh olethanh requested a review from nesitor April 15, 2025 10:07
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 2 times, most recently from 50c0f9b to 637df1a Compare April 15, 2025 12:54
@olethanh olethanh changed the title Ol aleph 287 retry instance Retry instances not running add new Execution list endpoint Apr 15, 2025
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 7 times, most recently from b4f1398 to d50c807 Compare April 22, 2025 13:48
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 2 times, most recently from e5ec35a to 9f1acdd Compare April 24, 2025 07:46
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch from 82b671e to db9d8b6 Compare May 19, 2025 09:41
olethanh added 16 commits May 19, 2025 12:03
Wait for boot, clean up  ressources if boot failed
Add a new executions list endpoint with more information on the state of the VM to be used by client
so we can better inform them on their instance state

This new endpoint list all executions in pool running or not
(but terminated VM are removed from the pool)
Handle VM that are stopping and starting before creating a new one
Prevent the case when the VM was not stopped when running execution.stop() directly instead of VmPool.stop_vm()
Simplify code, add warning
Change sig of enable_and_start to async

Adapt Firecracker instance test
- Rename to test_create_firecracker_instance
- Use mocker to patch() settings so it doesn't contamine other tests
- Ensure it ping properly to confirm it is working
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch from db9d8b6 to 56bc4c3 Compare May 19, 2025 10:04
@olethanh olethanh merged commit 8eebffa into main May 19, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants