Skip to content
This repository was archived by the owner on Feb 7, 2023. It is now read-only.

tree: initial changes to support Fedora STI/STR #368

Merged
merged 4 commits into from
Apr 2, 2018

Conversation

miabbott
Copy link
Collaborator

These two commits include the initial changes to try to support the Fedora Standard Test Interface and Standard Test Roles.

The first commit makes small changes to the improved-sanity-test so that it can be invoked by the Fedora STI.

The second commit changes the methodology of the reboot role to handle the way the STI boots up an Atomic Host as a test subject.

There are more details in the individual commit messages and I encourage you to read them.

(I'm not 100% confident about getting these merged in, so marking it as WIP for now)

See the following for some details:

https://fedoraproject.org/wiki/CI/Tests
https://fedoraproject.org/wiki/CI/Standard_Test_Roles

These changes allow the `improved-sanity-test` playbook to be invoked
using the same pattern as described in the docs above.

The notable change is the value to `hosts:` which will try to find a
target under the `subjects` group (which is how the inventory file is
built via the STI).  But defaulting to `all` retains backward
compatibility with the original way of running the playbook.
Additional details:

https://fedoraproject.org/wiki/CI/Tests
https://fedoraproject.org/wiki/CI/Standard_Test_Roles

When running any of the 'a-h-t' tests against an 'atomic' test subject
as defined by the Standard Test Interface, the harness will boot a VM
using `qemu` and fowards port 22 to a non-standard port (typically
I've seen 2222).  The side-effect of this is that the non-standard
port is always up and open, unless the VM has been torn down.  This
does *not* include when the host is rebooted.

This behavior causes our original methodology of waiting for the port
to go down and then come back up to be invalid.

I borrowed the idea of checking the `bootid` from the `rpm-ostree`
tests.  I also had to add in the ability to configure the `delay` on
the `wait_for` because occasionally the VM booted by the STI wouldn't
be ready within the default 30s.  Thankfully, this approach is
backwards compatible with the previous ways we have run the playbooks.

I'm still not 100% certain this is going to work all the time, but it
is a step in that direction.
@miabbott miabbott added the wip label Mar 23, 2018
@miabbott
Copy link
Collaborator Author

If you want to test this, this was my approach:

  1. On your workstation, install oci-kvm-hook. This will allow you to boot VMs from within a container.
  2. Run a Fedora 27 container
  3. In the container, install the following:

# dnf -y install ansible git python2-dnf libselinux-python qemu rsync standard-test-roles

  1. Grab a copy of the latest Fedora AH qcow2 - https://getfedora.org/atomic_qcow2_latest
  2. Setup env vars:
# export ANSIBLE_INVENTORY=$(test -e inventory && echo inventory || echo /usr/share/ansible/inventory)
# export TEST_SUBJECTS=/path/to/Fedora-Atomic-27-20180314.0.x86_64.qcow2
  1. Checkout the PR
  2. Run the playbook:

# ansible-playbook -v --tags atomic tests/improved-sanity-test/main.yml -e cli_wf_delay=60

@mike-nguyen
Copy link
Collaborator

Nice work so far. It worked for me

@mgahagan73
Copy link

It is working for me, I see boot/console logs for the atomic host under test under artifacts/ so I think the last remaining piece is to get the rest of the test logs under artifacts.

@mgahagan73
Copy link

I think we can add a tests.yml like the one below as well.

root@amd-pike-01 atomic-host-tests]# less tests.yml

  • include: tests/improved-sanity-test/main.yml
    environment:
    TEST_ARTIFACTS: ./artifacts
    AHT_RESULT_FILE: "{{ TEST_ARTIFACTS }}/improved-sanity-test.log"
    cli_wf_delay: 60
    tags:
    atomic

@miabbott
Copy link
Collaborator Author

miabbott commented Mar 28, 2018

@mgahagan73 I've been experimenting with that tests.yml file, but it doesn't seem like the environment settings are getting passed up to the include.

I'm thinking I'll have to make further changes to improved-sanity-test/main.yml itself to use those variables.

@mgahagan73
Copy link

I did find through more testing I had to declare cli_wf_delay as a var and not an environment variable to avoid failures due to the VM not being reachable. After my test runs I'm seeing the artifacts/ gets created with the boot log for the atomic host and the failure log being generated. My failure log gets created as an empty file when everything passes. This is all with the ansible 2.4 rpm shipped with F27 so maybe we are seeing differences between versions. I know the README mentions 2.2 is the version the atomic host playbooks were developed for.

[root@amd-pike-01 ~]# ansible --version
ansible 2.4.3.0
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.14 (default, Mar 14 2018, 13:36:31) [GCC 7.3.1 20180303 (Red Hat 7.3.1-5)]
[root@amd-pike-01 ~]# rpm -q ansible
ansible-2.4.3.0-1.fc27.noarch

@miabbott
Copy link
Collaborator Author

@mgahagan73 I'm running the playbook from inside a container and it has ansible-2.4.3.0-1.fc27.noarch installed.

I was seeing the log from the VM in the artifacts directory, but wasn't seeing any test log.

I might just merge these changes as is and then iterate towards the tests/tests.yml paradigm in a future PR.

@mgahagan73
Copy link

I'm only seeing a test log when I get a failure. Setting cli_wf_delay to a low value seems to be a pretty easy way to induce a failure which gave me a test log based on whatever AHT_RESULT_FILE is set to. I can't seem to redefine the artifacts variable from either in the playbook or on the ansible command line which I suspect to be an issue with standard-test-roles.

become: false

# I'm not sure the retries are even necessary, but I'm keeping them in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started coding this too a bit before but I think the problem is we actually need to retry both the ssh connection and the boot_id check. The way to think of this is: it was still racy before the STR approach since we were relying on the shutdown actually finding the ssh port closed, but we could (if the guest rebooted fast enough) have it always appear open. So we need to loop over both tasks. This led me to https://stackoverflow.com/questions/30785281/one-loop-over-multiple-ansible-tasks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters Did you have to do this in any of the ostree tests you've been working on?

What's the idea for the implementation? Break out those tasks into a separate file and loop over the include of the file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mike-nguyen found this post which discusses the problem of rebooting a Vagrant box and the SSH port is forwarded to a non-standard port (much like the STI does):

https://blog.jeffli.me/blog/2017/05/19/a-pitfall-of-vagrant-when-using-ansibles-wait-for-module/

TL;DR - use search_regex with the wait_for module

I'm going to try to play with this approach sometime today.

@mike-nguyen
Copy link
Collaborator

mike-nguyen commented Mar 28, 2018

I'm only seeing a test log when I get a failure. Setting cli_wf_delay to a low value seems to be a pretty easy way to induce a failure which gave me a test log based on whatever AHT_RESULT_FILE is set to. I can't seem to redefine the artifacts variable from either in the playbook or on the ansible command line which I suspect to be an issue with standard-test-roles.

The AHT_RESULT_FILE only captures the task/role that failed. It does not capture/contain the output of the test run.

@miabbott
Copy link
Collaborator Author

The AHT_RESULT_FILE only captures the task/role that failed. It does not contain contain the output of the test run.

Doh! Embarrassing when you forget how your own tests work...

So there is going to be some additional effort required to capture the entire log. Not sure if we want to do it natively as part of the callback plugin or just wrap the invocation of ansible_playbook in a shell script to capture the stdout. Any thoughts here?

Like I said before, I think I'd prefer to get this PR merged and iterate in other PRs towards a more full-fledged adoption of the STI.

@mike-nguyen found this post that talks about a similar problem we are
having with hosts started via the Fedora STI, specifically when the
standard SSH port is forwarded to a non-standard port:

https://blog.jeffli.me/blog/2017/05/19/a-pitfall-of-vagrant-when-using-ansibles-wait-for-module/

Using the `search_regex` function of `wait_for` proved to be useful in
this scenario, as we can leave the default delay of 30s intact and
don't have to bump that up.  (Which is effectively just sleeping the
whole execution for that amount of time.)
@miabbott miabbott removed the wip label Apr 2, 2018
@miabbott miabbott changed the title [WIP] tree: initial changes to support Fedora STI/STR tree: initial changes to support Fedora STI/STR Apr 2, 2018
@miabbott
Copy link
Collaborator Author

miabbott commented Apr 2, 2018

The last commit I made appears to have improved the reliability of the reboot role, so I'm lifting the WIP label for this set of changes. As stated previously, I'll work on additional changes required to support the STI.

@@ -17,6 +17,21 @@
- set_fact:
real_ansible_host: "{{ ansible_host }}"
timeout: "{{ cli_reboot_timeout | default('120') }}"
wf_delay: "{{ cli_wf_delay | default('30') }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this with the no cli_wf_delay passed in and it worked nicely! Is there still a need to be able to pass in a wf_delay with the regex_search?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I actually tested with cli_wf_delay=10 and it still worked well, so the regex is proving useful.

I'm not opposed to dropping out the option to configure the delay.

@mike-nguyen
Copy link
Collaborator

LGTM, As planned I will merge this and then subsequent STI work can be in another PR.

@mike-nguyen mike-nguyen merged commit 401d786 into projectatomic:master Apr 2, 2018
mike-nguyen added a commit to mike-nguyen/atomic-host-tests that referenced this pull request May 22, 2018
PR projectatomic#368 added a check for a boot_id to confirm reboots in the reboot
role.  This caused a race condition in the rpm_ostree_install and
rpm_ostree_uninstall roles when using the reboot flag.

The reboot role has an option to not perform a reboot and just check
that the system comes down and back up.  This was leverage by
rpm_ostree_install and rpm_ostree_uninstall roles to fire the the
respective commands with the -r flag.  The -r flag causes a reboot to
occur when the command executes but requires to be run asychronously
(the command won't return when the system goes down and will cause
Ansible to fail).  When rpm_ostree_install/rpm_ostree_uninstall was
called with the -r flag it calls the reboot role.  If the reboot role
can execute before the reboot occurs, it will successfully execute.
If the system goes down before the reboot role can grab the boot_id,
it will fail.

This PR modifies the reboot_flag test to not use the reboot role.  It
copies most of the logic from the reboot role into the test itself.
Since this was a corner case for the -r flag, I felt like it was an
appropriate exception to not re-use the role.
miabbott pushed a commit that referenced this pull request May 22, 2018
PR #368 added a check for a boot_id to confirm reboots in the reboot
role.  This caused a race condition in the rpm_ostree_install and
rpm_ostree_uninstall roles when using the reboot flag.

The reboot role has an option to not perform a reboot and just check
that the system comes down and back up.  This was leverage by
rpm_ostree_install and rpm_ostree_uninstall roles to fire the the
respective commands with the -r flag.  The -r flag causes a reboot to
occur when the command executes but requires to be run asychronously
(the command won't return when the system goes down and will cause
Ansible to fail).  When rpm_ostree_install/rpm_ostree_uninstall was
called with the -r flag it calls the reboot role.  If the reboot role
can execute before the reboot occurs, it will successfully execute.
If the system goes down before the reboot role can grab the boot_id,
it will fail.

This PR modifies the reboot_flag test to not use the reboot role.  It
copies most of the logic from the reboot role into the test itself.
Since this was a corner case for the -r flag, I felt like it was an
appropriate exception to not re-use the role.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants