-
Notifications
You must be signed in to change notification settings - Fork 22
tree: initial changes to support Fedora STI/STR #368
Conversation
See the following for some details: https://fedoraproject.org/wiki/CI/Tests https://fedoraproject.org/wiki/CI/Standard_Test_Roles These changes allow the `improved-sanity-test` playbook to be invoked using the same pattern as described in the docs above. The notable change is the value to `hosts:` which will try to find a target under the `subjects` group (which is how the inventory file is built via the STI). But defaulting to `all` retains backward compatibility with the original way of running the playbook.
Additional details: https://fedoraproject.org/wiki/CI/Tests https://fedoraproject.org/wiki/CI/Standard_Test_Roles When running any of the 'a-h-t' tests against an 'atomic' test subject as defined by the Standard Test Interface, the harness will boot a VM using `qemu` and fowards port 22 to a non-standard port (typically I've seen 2222). The side-effect of this is that the non-standard port is always up and open, unless the VM has been torn down. This does *not* include when the host is rebooted. This behavior causes our original methodology of waiting for the port to go down and then come back up to be invalid. I borrowed the idea of checking the `bootid` from the `rpm-ostree` tests. I also had to add in the ability to configure the `delay` on the `wait_for` because occasionally the VM booted by the STI wouldn't be ready within the default 30s. Thankfully, this approach is backwards compatible with the previous ways we have run the playbooks. I'm still not 100% certain this is going to work all the time, but it is a step in that direction.
If you want to test this, this was my approach:
|
Nice work so far. It worked for me |
It is working for me, I see boot/console logs for the atomic host under test under artifacts/ so I think the last remaining piece is to get the rest of the test logs under artifacts. |
I think we can add a tests.yml like the one below as well. root@amd-pike-01 atomic-host-tests]# less tests.yml
|
@mgahagan73 I've been experimenting with that I'm thinking I'll have to make further changes to |
I did find through more testing I had to declare cli_wf_delay as a var and not an environment variable to avoid failures due to the VM not being reachable. After my test runs I'm seeing the artifacts/ gets created with the boot log for the atomic host and the failure log being generated. My failure log gets created as an empty file when everything passes. This is all with the ansible 2.4 rpm shipped with F27 so maybe we are seeing differences between versions. I know the README mentions 2.2 is the version the atomic host playbooks were developed for. [root@amd-pike-01 ~]# ansible --version |
@mgahagan73 I'm running the playbook from inside a container and it has I was seeing the log from the VM in the artifacts directory, but wasn't seeing any test log. I might just merge these changes as is and then iterate towards the |
I'm only seeing a test log when I get a failure. Setting cli_wf_delay to a low value seems to be a pretty easy way to induce a failure which gave me a test log based on whatever AHT_RESULT_FILE is set to. I can't seem to redefine the artifacts variable from either in the playbook or on the ansible command line which I suspect to be an issue with standard-test-roles. |
roles/reboot/tasks/main.yml
Outdated
become: false | ||
|
||
# I'm not sure the retries are even necessary, but I'm keeping them in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started coding this too a bit before but I think the problem is we actually need to retry both the ssh connection and the boot_id check. The way to think of this is: it was still racy before the STR approach since we were relying on the shutdown actually finding the ssh port closed, but we could (if the guest rebooted fast enough) have it always appear open. So we need to loop over both tasks. This led me to https://stackoverflow.com/questions/30785281/one-loop-over-multiple-ansible-tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgwalters Did you have to do this in any of the ostree tests you've been working on?
What's the idea for the implementation? Break out those tasks into a separate file and loop over the include
of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mike-nguyen found this post which discusses the problem of rebooting a Vagrant box and the SSH port is forwarded to a non-standard port (much like the STI does):
https://blog.jeffli.me/blog/2017/05/19/a-pitfall-of-vagrant-when-using-ansibles-wait-for-module/
TL;DR - use search_regex
with the wait_for
module
I'm going to try to play with this approach sometime today.
The AHT_RESULT_FILE only captures the task/role that failed. It does not capture/contain the output of the test run. |
Doh! Embarrassing when you forget how your own tests work... So there is going to be some additional effort required to capture the entire log. Not sure if we want to do it natively as part of the callback plugin or just wrap the invocation of Like I said before, I think I'd prefer to get this PR merged and iterate in other PRs towards a more full-fledged adoption of the STI. |
@mike-nguyen found this post that talks about a similar problem we are having with hosts started via the Fedora STI, specifically when the standard SSH port is forwarded to a non-standard port: https://blog.jeffli.me/blog/2017/05/19/a-pitfall-of-vagrant-when-using-ansibles-wait-for-module/ Using the `search_regex` function of `wait_for` proved to be useful in this scenario, as we can leave the default delay of 30s intact and don't have to bump that up. (Which is effectively just sleeping the whole execution for that amount of time.)
The last commit I made appears to have improved the reliability of the |
roles/reboot/tasks/main.yml
Outdated
@@ -17,6 +17,21 @@ | |||
- set_fact: | |||
real_ansible_host: "{{ ansible_host }}" | |||
timeout: "{{ cli_reboot_timeout | default('120') }}" | |||
wf_delay: "{{ cli_wf_delay | default('30') }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested this with the no cli_wf_delay passed in and it worked nicely! Is there still a need to be able to pass in a wf_delay with the regex_search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I actually tested with cli_wf_delay=10
and it still worked well, so the regex is proving useful.
I'm not opposed to dropping out the option to configure the delay.
LGTM, As planned I will merge this and then subsequent STI work can be in another PR. |
PR projectatomic#368 added a check for a boot_id to confirm reboots in the reboot role. This caused a race condition in the rpm_ostree_install and rpm_ostree_uninstall roles when using the reboot flag. The reboot role has an option to not perform a reboot and just check that the system comes down and back up. This was leverage by rpm_ostree_install and rpm_ostree_uninstall roles to fire the the respective commands with the -r flag. The -r flag causes a reboot to occur when the command executes but requires to be run asychronously (the command won't return when the system goes down and will cause Ansible to fail). When rpm_ostree_install/rpm_ostree_uninstall was called with the -r flag it calls the reboot role. If the reboot role can execute before the reboot occurs, it will successfully execute. If the system goes down before the reboot role can grab the boot_id, it will fail. This PR modifies the reboot_flag test to not use the reboot role. It copies most of the logic from the reboot role into the test itself. Since this was a corner case for the -r flag, I felt like it was an appropriate exception to not re-use the role.
PR #368 added a check for a boot_id to confirm reboots in the reboot role. This caused a race condition in the rpm_ostree_install and rpm_ostree_uninstall roles when using the reboot flag. The reboot role has an option to not perform a reboot and just check that the system comes down and back up. This was leverage by rpm_ostree_install and rpm_ostree_uninstall roles to fire the the respective commands with the -r flag. The -r flag causes a reboot to occur when the command executes but requires to be run asychronously (the command won't return when the system goes down and will cause Ansible to fail). When rpm_ostree_install/rpm_ostree_uninstall was called with the -r flag it calls the reboot role. If the reboot role can execute before the reboot occurs, it will successfully execute. If the system goes down before the reboot role can grab the boot_id, it will fail. This PR modifies the reboot_flag test to not use the reboot role. It copies most of the logic from the reboot role into the test itself. Since this was a corner case for the -r flag, I felt like it was an appropriate exception to not re-use the role.
These two commits include the initial changes to try to support the Fedora Standard Test Interface and Standard Test Roles.
The first commit makes small changes to the
improved-sanity-test
so that it can be invoked by the Fedora STI.The second commit changes the methodology of the
reboot
role to handle the way the STI boots up an Atomic Host as a test subject.There are more details in the individual commit messages and I encourage you to read them.
(I'm not 100% confident about getting these merged in, so marking it as WIP for now)