tree: initial changes to support Fedora STI/STR #368

miabbott · 2018-03-23T21:26:23Z

These two commits include the initial changes to try to support the Fedora Standard Test Interface and Standard Test Roles.

The first commit makes small changes to the improved-sanity-test so that it can be invoked by the Fedora STI.

The second commit changes the methodology of the reboot role to handle the way the STI boots up an Atomic Host as a test subject.

There are more details in the individual commit messages and I encourage you to read them.

(I'm not 100% confident about getting these merged in, so marking it as WIP for now)

See the following for some details: https://fedoraproject.org/wiki/CI/Tests https://fedoraproject.org/wiki/CI/Standard_Test_Roles These changes allow the `improved-sanity-test` playbook to be invoked using the same pattern as described in the docs above. The notable change is the value to `hosts:` which will try to find a target under the `subjects` group (which is how the inventory file is built via the STI). But defaulting to `all` retains backward compatibility with the original way of running the playbook.

Additional details: https://fedoraproject.org/wiki/CI/Tests https://fedoraproject.org/wiki/CI/Standard_Test_Roles When running any of the 'a-h-t' tests against an 'atomic' test subject as defined by the Standard Test Interface, the harness will boot a VM using `qemu` and fowards port 22 to a non-standard port (typically I've seen 2222). The side-effect of this is that the non-standard port is always up and open, unless the VM has been torn down. This does *not* include when the host is rebooted. This behavior causes our original methodology of waiting for the port to go down and then come back up to be invalid. I borrowed the idea of checking the `bootid` from the `rpm-ostree` tests. I also had to add in the ability to configure the `delay` on the `wait_for` because occasionally the VM booted by the STI wouldn't be ready within the default 30s. Thankfully, this approach is backwards compatible with the previous ways we have run the playbooks. I'm still not 100% certain this is going to work all the time, but it is a step in that direction.

miabbott · 2018-03-26T13:22:03Z

If you want to test this, this was my approach:

On your workstation, install oci-kvm-hook. This will allow you to boot VMs from within a container.
Run a Fedora 27 container
In the container, install the following:

# dnf -y install ansible git python2-dnf libselinux-python qemu rsync standard-test-roles

Grab a copy of the latest Fedora AH qcow2 - https://getfedora.org/atomic_qcow2_latest
Setup env vars:

# export ANSIBLE_INVENTORY=$(test -e inventory && echo inventory || echo /usr/share/ansible/inventory)
# export TEST_SUBJECTS=/path/to/Fedora-Atomic-27-20180314.0.x86_64.qcow2

Checkout the PR
Run the playbook:

# ansible-playbook -v --tags atomic tests/improved-sanity-test/main.yml -e cli_wf_delay=60

mike-nguyen · 2018-03-26T19:17:09Z

Nice work so far. It worked for me

mgahagan73 · 2018-03-27T15:34:21Z

It is working for me, I see boot/console logs for the atomic host under test under artifacts/ so I think the last remaining piece is to get the rest of the test logs under artifacts.

mgahagan73 · 2018-03-28T03:26:24Z

I think we can add a tests.yml like the one below as well.

root@amd-pike-01 atomic-host-tests]# less tests.yml

include: tests/improved-sanity-test/main.yml
environment:
TEST_ARTIFACTS: ./artifacts
AHT_RESULT_FILE: "{{ TEST_ARTIFACTS }}/improved-sanity-test.log"
cli_wf_delay: 60
tags:
atomic

miabbott · 2018-03-28T16:13:48Z

@mgahagan73 I've been experimenting with that tests.yml file, but it doesn't seem like the environment settings are getting passed up to the include.

I'm thinking I'll have to make further changes to improved-sanity-test/main.yml itself to use those variables.

mgahagan73 · 2018-03-28T16:31:12Z

I did find through more testing I had to declare cli_wf_delay as a var and not an environment variable to avoid failures due to the VM not being reachable. After my test runs I'm seeing the artifacts/ gets created with the boot log for the atomic host and the failure log being generated. My failure log gets created as an empty file when everything passes. This is all with the ansible 2.4 rpm shipped with F27 so maybe we are seeing differences between versions. I know the README mentions 2.2 is the version the atomic host playbooks were developed for.

[root@amd-pike-01 ~]# ansible --version
ansible 2.4.3.0
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.14 (default, Mar 14 2018, 13:36:31) [GCC 7.3.1 20180303 (Red Hat 7.3.1-5)]
[root@amd-pike-01 ~]# rpm -q ansible
ansible-2.4.3.0-1.fc27.noarch

miabbott · 2018-03-28T18:42:12Z

@mgahagan73 I'm running the playbook from inside a container and it has ansible-2.4.3.0-1.fc27.noarch installed.

I was seeing the log from the VM in the artifacts directory, but wasn't seeing any test log.

I might just merge these changes as is and then iterate towards the tests/tests.yml paradigm in a future PR.

mgahagan73 · 2018-03-28T19:11:59Z

I'm only seeing a test log when I get a failure. Setting cli_wf_delay to a low value seems to be a pretty easy way to induce a failure which gave me a test log based on whatever AHT_RESULT_FILE is set to. I can't seem to redefine the artifacts variable from either in the playbook or on the ansible command line which I suspect to be an issue with standard-test-roles.

cgwalters · 2018-03-28T19:15:19Z

roles/reboot/tasks/main.yml

  become: false

+# I'm not sure the retries are even necessary, but I'm keeping them in


I started coding this too a bit before but I think the problem is we actually need to retry both the ssh connection and the boot_id check. The way to think of this is: it was still racy before the STR approach since we were relying on the shutdown actually finding the ssh port closed, but we could (if the guest rebooted fast enough) have it always appear open. So we need to loop over both tasks. This led me to https://stackoverflow.com/questions/30785281/one-loop-over-multiple-ansible-tasks

@cgwalters Did you have to do this in any of the ostree tests you've been working on?

What's the idea for the implementation? Break out those tasks into a separate file and loop over the include of the file?

@mike-nguyen found this post which discusses the problem of rebooting a Vagrant box and the SSH port is forwarded to a non-standard port (much like the STI does):

https://blog.jeffli.me/blog/2017/05/19/a-pitfall-of-vagrant-when-using-ansibles-wait-for-module/

TL;DR - use search_regex with the wait_for module

I'm going to try to play with this approach sometime today.

mike-nguyen · 2018-03-28T19:50:16Z

I'm only seeing a test log when I get a failure. Setting cli_wf_delay to a low value seems to be a pretty easy way to induce a failure which gave me a test log based on whatever AHT_RESULT_FILE is set to. I can't seem to redefine the artifacts variable from either in the playbook or on the ansible command line which I suspect to be an issue with standard-test-roles.

The AHT_RESULT_FILE only captures the task/role that failed. It does not capture/contain the output of the test run.

miabbott · 2018-03-28T20:09:14Z

The AHT_RESULT_FILE only captures the task/role that failed. It does not contain contain the output of the test run.

Doh! Embarrassing when you forget how your own tests work...

So there is going to be some additional effort required to capture the entire log. Not sure if we want to do it natively as part of the callback plugin or just wrap the invocation of ansible_playbook in a shell script to capture the stdout. Any thoughts here?

Like I said before, I think I'd prefer to get this PR merged and iterate in other PRs towards a more full-fledged adoption of the STI.

@mike-nguyen

@mike-nguyen found this post that talks about a similar problem we are having with hosts started via the Fedora STI, specifically when the standard SSH port is forwarded to a non-standard port: https://blog.jeffli.me/blog/2017/05/19/a-pitfall-of-vagrant-when-using-ansibles-wait-for-module/ Using the `search_regex` function of `wait_for` proved to be useful in this scenario, as we can leave the default delay of 30s intact and don't have to bump that up. (Which is effectively just sleeping the whole execution for that amount of time.)

miabbott · 2018-04-02T17:20:47Z

The last commit I made appears to have improved the reliability of the reboot role, so I'm lifting the WIP label for this set of changes. As stated previously, I'll work on additional changes required to support the STI.

mike-nguyen · 2018-04-02T17:24:19Z

roles/reboot/tasks/main.yml

@@ -17,6 +17,21 @@
 - set_fact:
    real_ansible_host: "{{ ansible_host }}"
    timeout: "{{ cli_reboot_timeout | default('120') }}"
+    wf_delay: "{{ cli_wf_delay | default('30') }}"


I just tested this with the no cli_wf_delay passed in and it worked nicely! Is there still a need to be able to pass in a wf_delay with the regex_search?

Aside: I actually tested with cli_wf_delay=10 and it still worked well, so the regex is proving useful.

I'm not opposed to dropping out the option to configure the delay.

mike-nguyen · 2018-04-02T18:44:33Z

LGTM, As planned I will merge this and then subsequent STI work can be in another PR.

PR projectatomic#368 added a check for a boot_id to confirm reboots in the reboot role. This caused a race condition in the rpm_ostree_install and rpm_ostree_uninstall roles when using the reboot flag. The reboot role has an option to not perform a reboot and just check that the system comes down and back up. This was leverage by rpm_ostree_install and rpm_ostree_uninstall roles to fire the the respective commands with the -r flag. The -r flag causes a reboot to occur when the command executes but requires to be run asychronously (the command won't return when the system goes down and will cause Ansible to fail). When rpm_ostree_install/rpm_ostree_uninstall was called with the -r flag it calls the reboot role. If the reboot role can execute before the reboot occurs, it will successfully execute. If the system goes down before the reboot role can grab the boot_id, it will fail. This PR modifies the reboot_flag test to not use the reboot role. It copies most of the logic from the reboot role into the test itself. Since this was a corner case for the -r flag, I felt like it was an appropriate exception to not re-use the role.

PR #368 added a check for a boot_id to confirm reboots in the reboot role. This caused a race condition in the rpm_ostree_install and rpm_ostree_uninstall roles when using the reboot flag. The reboot role has an option to not perform a reboot and just check that the system comes down and back up. This was leverage by rpm_ostree_install and rpm_ostree_uninstall roles to fire the the respective commands with the -r flag. The -r flag causes a reboot to occur when the command executes but requires to be run asychronously (the command won't return when the system goes down and will cause Ansible to fail). When rpm_ostree_install/rpm_ostree_uninstall was called with the -r flag it calls the reboot role. If the reboot role can execute before the reboot occurs, it will successfully execute. If the system goes down before the reboot role can grab the boot_id, it will fail. This PR modifies the reboot_flag test to not use the reboot role. It copies most of the logic from the reboot role into the test itself. Since this was a corner case for the -r flag, I felt like it was an appropriate exception to not re-use the role.

miabbott added 2 commits March 23, 2018 20:54

miabbott added the wip label Mar 23, 2018

cgwalters reviewed Mar 28, 2018

View reviewed changes

miabbott removed the wip label Apr 2, 2018

miabbott changed the title ~~[WIP] tree: initial changes to support Fedora STI/STR~~ tree: initial changes to support Fedora STI/STR Apr 2, 2018

mike-nguyen reviewed Apr 2, 2018

View reviewed changes

fixup! reboot: use search_regex to improve resiliency

13181ac

mike-nguyen merged commit 401d786 into projectatomic:master Apr 2, 2018

		become: false

		# I'm not sure the retries are even necessary, but I'm keeping them in

tree: initial changes to support Fedora STI/STR #368

tree: initial changes to support Fedora STI/STR #368

Uh oh!

Conversation

miabbott commented Mar 23, 2018

Uh oh!

miabbott commented Mar 26, 2018

Uh oh!

mike-nguyen commented Mar 26, 2018

Uh oh!

mgahagan73 commented Mar 27, 2018

Uh oh!

mgahagan73 commented Mar 28, 2018

Uh oh!

miabbott commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgahagan73 commented Mar 28, 2018

Uh oh!

miabbott commented Mar 28, 2018

Uh oh!

mgahagan73 commented Mar 28, 2018

Uh oh!

cgwalters Mar 28, 2018

Choose a reason for hiding this comment

Uh oh!

miabbott Mar 28, 2018

Choose a reason for hiding this comment

Uh oh!

miabbott Mar 29, 2018

Choose a reason for hiding this comment

Uh oh!

mike-nguyen commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miabbott commented Mar 28, 2018

Uh oh!

miabbott commented Apr 2, 2018

Uh oh!

mike-nguyen Apr 2, 2018

Choose a reason for hiding this comment

Uh oh!

miabbott Apr 2, 2018

Choose a reason for hiding this comment

Uh oh!

mike-nguyen commented Apr 2, 2018

Uh oh!

Uh oh!

miabbott commented Mar 28, 2018 •

edited

Loading

mike-nguyen commented Mar 28, 2018 •

edited

Loading