Update documentation for deploying new GPU FYI test bots.

Worked with zmo@ to develop this documentation along with the example CLs. Also added top-level links to the GPU docs and cleaned up some other now-incorrect documentation. More cleanups are still needed. Bug: 838970 Tbr: zmo@chromium.org Change-Id: I40e07adf864974a645490627f5a055ce686237a3 Reviewed-on: https://chromium-review.googlesource.com/1041273 Commit-Queue: Kenneth Russell <kbr@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#555661}
marcosholgado · May 3, 2018 · 9618add · 9618add
1 parent 60347d6
commit 9618add
Show file tree

Hide file tree

Showing 2 changed files with 127 additions and 98 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -163,6 +163,16 @@ used when committed.
     built, with associated desciptions.
 *   [IPC Fuzzer](ipc_fuzzer.md) - Fuzz testing of Chromium IPC interfaces.
 
+### GPU-related docs
+*   [GPU Pixel Wrangling](gpu/pixel_wrangling.md) - Instructions for GPU
+    pixel wrangling (the GPU sheriffing rotation).
+*   [Debugging GPU related code](gpu/debugging_gpu_related_code.md) - Hints for
+    debugging GPU- and graphics-related code.
+*   [GPU Testing](gpu/gpu_testing.md) - Description of Chromium's GPU testing
+    infrastructure.
+*   [GPU Bot Details](gpu/gpu_testing_bot_details.md) - In-depth description of
+    how the bots are maintained.
+
 ### Misc Linux-Specific Docs
 *   [Linux Proxy Config](linux_proxy_config.md) - Network proxy sources on Linux
 *   [Debugging SSL on Linux](linux_debugging_ssl.md) - Tips on debugging SSL

diff --git a/docs/gpu/gpu_testing_bot_details.md b/docs/gpu/gpu_testing_bot_details.md
@@ -1,6 +1,6 @@
 # GPU Bot Details
 
-This PAGE describes in detail how the GPU bots are set up, which files affect
+This page describes in detail how the GPU bots are set up, which files affect
 their configuration, and how to both modify their behavior and add new bots.
 
 [TOC]
@@ -20,15 +20,16 @@ browser.
 The GPU bots are utilized on the [chromium.gpu] and [chromium.gpu.fyi]
 waterfalls, and various tryservers, as described in [Using the GPU Bots].
 
-[chromium.gpu]: https://build.chromium.org/p/chromium.gpu/console
-[chromium.gpu.fyi]: https://build.chromium.org/p/chromium.gpu.fyi/console
+[chromium.gpu]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console
+[chromium.gpu.fyi]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console
 [Using the GPU Bots]: gpu_testing.md#Using-the-GPU-Bots
 
-The vast majority of the hardware for the bots lives in the Chrome-GPU Swarming
-pool. The waterfall bots are simply virtual machines which spawn Swarming tasks
-with the appropriate tags to get them to run on the desired GPU and operating
-system type. So, for example, the [Win10 Release (NVIDIA)] bot is actually a
-virtual machine which spawns all of its jobs with the Swarming parameters:
+All of the physical hardware for the bots lives in the Swarming pool, and most
+of it in the Chrome-GPU Swarming pool. The waterfall bots are simply virtual
+machines which spawn Swarming tasks with the appropriate tags to get them to run
+on the desired GPU and operating system type. So, for example, the [Win10
+Release (NVIDIA)] bot is actually a virtual machine which spawns all of its jobs
+with the Swarming parameters:
 
 [Win10 Release (NVIDIA)]: https://ci.chromium.org/buildbot/chromium.gpu/Win10%20Release%20%28NVIDIA%29/?limit=200
 
@@ -239,44 +240,7 @@ This is described in [Adding new tests to the GPU bots].
 
 ### How to add a new bot
 
-The first decision point when adding a new GPU bot is whether it is a one-off
-piece of hardware, or one which is expected to be scaled up at some point. If
-it's a one-off piece of hardware, it can be added to the chromium.gpu.fyi
-waterfall as a non-swarmed test machine. If it's expected to be scaled up at
-some point, the hardware should be added to the swarming pool. These two
-scenarios are described in more detail below.
-
-#### How to add a new, non-swarmed, physical bot to the chromium.gpu.fyi waterfall
-
-1.  Work with the Chrome Infrastructure Labs team to get the hardware deployed
-    so it can talk to the chromium.gpu.fyi master.
-1.  Create a CL in the build workspace which:
-    1.  Add the new machine to
-        [`masters/master.chromium.gpu.fyi/builders.pyl`][master.chromium.gpu.fyi/builders.pyl].
-    1.  Add the new machine to
-        [`scripts/slave/recipe_modules/chromium_tests/chromium_gpu_fyi.py`][chromium_gpu_fyi.py].
-        Set the `enable_swarming` property to `False`.
-    1.  Retrain recipe expectations
-        (`scripts/slave/recipes.py --use-bootstrap test train`) and add the
-        newly created JSON file(s) corresponding to the new machines to your CL.
-1.  Create a CL in the Chromium workspace to:
-    1.  Add the new machine to
-        [`src/content/test/gpu/generate_buildbot_json.py`][generate_buildbot_json.py].
-        Make sure to set the `swarming` property to `False`.
-    1.  If the machine runs GN, add a description to
-        [`src/tools/mb/mb_config.pyl`][mb_config.pyl].
-1.  Once the build workspace CL lands, use go/bug-a-trooper (or contact kbr@)
-    to schedule a restart of the chromium.gpu.fyi waterfall. This is only
-    necessary when modifying files under the masters/ directory. A reboot of
-    the machine may be needed once the waterfall has been restarted in order to
-    make it connect properly.
-1.  The CLs from (2) and (3) can land in either order, though it is preferable
-    to land the Chromium-side CL first so that the machine knows what tests to
-    run the first time it boots up.
-
-[master.chromium.gpu.fyi/builders.pyl]: https://chromium.googlesource.com/chromium/tools/build/+/master/masters/master.chromium.gpu.fyi/builders.pyl
-
-#### How to add a new swarmed bot to the chromium.gpu.fyi waterfall
+#### How to add a new tester bot to the chromium.gpu.fyi waterfall
 
 When deploying a new GPU configuration, it should be added to the
 chromium.gpu.fyi waterfall first. The chromium.gpu waterfall should be reserved
@@ -290,37 +254,63 @@ In order to add Release and Debug waterfall bots for a new configuration,
 experience has shown that at least 4 physical machines are needed in the
 swarming pool. The reason is that the tests all run in parallel on the Swarming
 cluster, so the load induced on the swarming bots is higher than it would be
-for a non-swarmed bot that executes its tests serially.
+if the tests were run strictly serially.
 
-With these prerequisites, these are the steps to add a new swarmed bot.
-(Actually, pair of bots -- Release and Debug.)
+With these prerequisites, these are the steps to add a new (swarmed) tester bot.
+(Actually, pair of bots -- Release and Debug. If deploying just one or the
+other, ignore the other configuration.) These instructions assume that you are
+reusing one of the existing builders, like [`GPU FYI Win Builder`][GPU FYI Win
+Builder].
 
 1.  Work with the Chrome Infrastructure Labs team to get the (minimum 4)
     physical machines added to the Swarming pool. Use
     [chromium-swarm.appspot.com] or `src/tools/swarming_client/swarming.py bots`
     to determine the PCI IDs of the GPUs in the bots. (These instructions will
     need to be updated for Android bots which don't have PCI buses.)
+
     1.  Make sure to add these new machines to the Chrome-GPU Swarming pool by
         creating a CL against [`configs/chromium-swarm/bots.cfg`][bots.cfg] in
-        the [infradata/config] workspace.
+        the [infradata/config] (Google internal) workspace. Here is an [example
+        CL](https://chrome-internal-review.googlesource.com/524420).
+
 1.  File a Chrome Infrastructure Labs ticket requesting 2 virtual machines for
     the testers. These need to match the OS of the physical machines and
     builders because of limitations in the scripts which transfer builds from
     the builder to the tester; see [this feature
     request](http://crbug.com/581953). For example, if you're adding a "Windows
     7 CoolNewGPUType" tester, you'll need 2 Windows VMs.
-1.  Once the VMs are ready, create a CL in the build workspace which:
-    1.  Adds the new VMs as the Release and Debug bots in
-        [`master.chromium.gpu.fyi/builders.pyl`][master.chromium.gpu.fyi/builders.pyl].
-    1.  Adds the new VMs to [`chromium_gpu_fyi.py`][chromium_gpu_fyi.py]. Make
-        sure to set the `enable_swarming` and `serialize_tests` properties to
-        `True`. Double-check the `parent_buildername` property for each. It
-        must match the Release/Debug flavor of the builder.
-    1.  Retrain recipe expectations
-        (`scripts/slave/recipes.py --use-bootstrap test train`) and add the
-        newly created JSON file(s) corresponding to the new machines to your CL.
-1.  Create a CL in the Chromium workspace which:
-    1.  Adds the new machine to
+
+1.  Once the VMs are ready, create a CL in the
+    [`infradata/config`][infradata/config] (Google internal) workspace which
+    does the following. Here's an [example
+    CL](https://chrome-internal-review.googlesource.com/619497).
+    1.  Adds two new "bot_group" blocks in the Chromium GPU FYI section of
+        [`configs/chromium-swarm/bots.cfg`][bots.cfg], one for the Release bot
+        and one for the Debug bot. Copy the closest configuration you can find
+        -- for example, Windows, Android, etc.
+    1.  Get this reviewed and landed. This step associates the VM with the bot's
+        name on the waterfall.
+
+1. Create a CL in the [`tools/build`][tools/build] workspace which does the
+   following. Here's an [example CL](https://chromium-review.googlesource.com/1041145).
+    1.  Adds the new VMs to [`chromium_gpu_fyi.py`][chromium_gpu_fyi.py] in
+        `scripts/slave/recipe_modules/chromium_tests/`. Make sure to set the
+        `serialize_tests` property to `True`. This is specified for waterfall
+        bots, but not trybots, and helps avoid overloading the physical
+        hardware. Double-check the `BUILD_CONFIG` and `parent_buildername`
+        properties for each. They must match the Release/Debug flavor of the
+        builder, like `GPU FYI Win Builder` vs. `GPU FYI Win Builder (dbg)`.
+    1.  Get this reviewed and landed. This step tells the Chromium recipe about
+        the newly-deployed waterfall bot. In the next CL we will generate the
+        JSON files which run tests on the bot.
+    1.  It used to be necessary to retrain recipe expectations
+        (`scripts/slave/recipes.py --use-bootstrap test train`). This doesn't
+        appear to be necessary any more, but it's something to watch out for if
+        your CL fails presubmit for some reason.
+
+1.  Create a CL in the Chromium workspace which does the following. Here's an
+    [example CL](https://chromium-review.googlesource.com/1041164).
+    1.  Adds the new machines to
         `src/content/test/gpu/generate_buildbot_json.py`.
         1.  The swarming dimensions are crucial. These must match the GPU and
             OS type of the physical hardware in the Swarming pool. This is what
@@ -339,19 +329,42 @@ With these prerequisites, these are the steps to add a new swarmed bot.
             `Windows-2008ServerR2-SP1` (the Win7-like flavor running in our
             data center). Similarly, the Win8 bots had to have a very precise
             OS description (`Windows-2012ServerR2-SP0`).
-    1.  If the machine runs GN, adds a description to
-        [`src/tools/mb/mb_config.pyl`][mb_config.pyl].
-1.  Once the tools/build CL lands, use go/bug-a-trooper (or contact kbr@) to
-    schedule a restart of the chromium.gpu.fyi waterfall. This is only
-    necessary when modifying files under the masters/ directory. A reboot of
-    the VMs may be needed once the waterfall has been restarted in order to
-    make them connect properly.
-1.  The CLs from (3) and (4) can land in either order, though it is preferable
-    to land the Chromium-side CL first so that the machine knows what tests to
-    run the first time it boots up.
-
-[bots.cfg]: https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/bots.cfg
-[infradata/config]: https://chrome-internal.googlesource.com/infradata/config/
+        1.  If you're deploying a new bot that's similar to another existing
+            configuration, please search around in the file for references to
+            the other bot's name and see if your new bot needs to be added to
+            any exclusion lists. For example, some of the tests don't run on
+            certain Win bots because of missing OpenGL extensions.
+        1.  Run this script to regenerate
+            `src/testing/buildbot/chromium.gpu.fyi.json`.
+    1. Updates [`cr-buildbucket.cfg`][cr-buildbucket.cfg]:
+        *   Add the two new machines (Release and Debug) inside the
+            luci.chromium.ci bucket. This sets up storage for the builds in the
+            system. Use the appropriate mixin; for example, "win-gpu-fyi-ci" has
+            already been set up for Windows GPU FYI bots on the waterfall.
+    1. Updates [`luci-scheduler.cfg`][luci-scheduler.cfg]:
+        *   Add new "job" blocks for your new Release and Debug test bots. They
+            should go underneath the builder which triggers them (like "GPU Win
+            FYI Builder"), in alphabetical order. Make sure the "id" and
+            "builer" entries match. This job block should use the acl_sets
+            "triggered-by-parent-builders", because it's triggered by the
+            builder, and not by changes to the git repository.
+    1. Updates [`luci-milo.cfg`][luci-milo.cfg]:
+        *   Add new "builders" blocks for your new testers (Release and Debug)
+            on the [`chromium.gpu.fyi`][chromium.gpu.fyi] console. Look at the
+            short names and categories and try to come up with a reasonable
+            organization.
+    1.  If you were adding a new builder, you would need to also add the new
+        machine to [`src/tools/mb/mb_config.pyl`][mb_config.pyl].
+1. After the Chromium-side CL lands it will take some time for all of the
+   configuration changes to be picked up by the system. The bot might be in a
+   purple state until that's happened, but it will recover.
+
+[bots.cfg]:            https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/bots.cfg
+[infradata/config]:    https://chrome-internal.googlesource.com/infradata/config/
+[cr-buildbucket.cfg]:  https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/cr-buildbucket.cfg
+[luci-milo.cfg]:  https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/luci-milo.cfg
+[luci-scheduler.cfg]:  https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/luci-scheduler.cfg
+[GPU FYI Win Builder]: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/GPU%20FYI%20Win%20Builder
 
 #### How to start running tests on a new GPU type on an existing try bot
 
@@ -444,34 +457,40 @@ to run the new driver on one of the waterfalls for a day or two to make sure
 the tests are running reliably green before rolling out the driver update
 everywhere. To do this:
 
-1.  Work with the Chrome Infrastructure Labs team to deploy a single,
-    non-swarmed, physical machine on the chromium.gpu.fyi waterfall running the
-    new driver. The OS and GPU should exactly match the configuration you
-    intend to upgrade. See
-    [How to add a new, non-swarmed, physical bot to the chromium.gpu.fyi waterfall](#How-to-add-a-new_non-swarmed_physical-bot-to-the-chromium_gpu_fyi-waterfall).
-2.  Hopefully, the new machine will pass the pixel tests. If it doesn't, then
+1.  Make sure that all of the current Swarming jobs for this OS and GPU
+    configuration are targeted at the "stable" version of the driver in
+    `src/testing/gpu/generate_buildbot_json.py`.
+1.  File a `Build Infrastructure` bug, component `Infra>Labs`, to have ~4 of the
+    physical machines already in the Swarming pool upgraded to the new version
+    of the driver.
+1.  If an "experimental" version of this bot doesn't yet exist, follow the
+    instructions above for [How to add a new tester bot to the chromium.gpu.fyi
+    waterfall](#How-to-add-a-new-tester-bot-to-the-chromium_gpu_fyi-waterfall)
+    to deploy one.
+1.  Have this experimental bot target the new version of the driver in
+    `src/testing/gpu/generate_buildbot_json.py`.
+1.  Hopefully, the new machine will pass the pixel tests. If it doesn't, then
     unfortunately, it'll be necessary to follow the instructions on
     [updating the pixel tests] to temporarily suppress the failures on this
     particular configuration. Keep the time window for these test suppressions
     as narrow as possible.
-3.  Watch the new machine for a day or two to make sure it's stable.
-4.  When it is, ask the Chrome Infrastructure Labs team to roll out the driver
-    update across all of the similarly configured bots in the swarming pool.
-5.  If necessary, update pixel test expectations and remove the suppressions
+1.  Watch the new machine for a day or two to make sure it's stable.
+1.  When it is, update `src/testing/gpu/generate_buildbot_json.py` to use the
+    "gpu trigger script" functionality to select *either* the stable *or* the
+    new driver version on the stable version of the bot. See [this
+    CL](https://chromium-review.googlesource.com/882344) for an example, though
+    that CL was targeting a different OS version rather than driver version.
+1.  After that lands, ask the Chrome Infrastructure Labs team to roll out the
+    driver update across all of the similarly configured bots in the swarming
+    pool.
+1.  If necessary, update pixel test expectations and remove the suppressions
     added above.
-6.  Prepare and land a CL removing the temporary machine from the
-    chromium.gpu.fyi waterfall. Request a waterfall restart.
-7.  File a ticket with the Chrome Infrastructure Labs team to reclaim the
-    temporary machine.
-
-Note that with recent improvements to Swarming, in particular [this
-RFE](https://github.com/luci/luci-py/issues/253) and others, these steps are no
-longer strictly necessary – it's possible to target Swarming jobs at a
-particular driver version. If
-[`generate_buildbot_json.py`][generate_buildbot_json.py] were improved to be
-more specific about the driver version on the various bots, then the machines
-with the new drivers could simply be added to the Swarming pool, and this
-process could be a lot simpler. Patches welcome. :)
+1.  Remove the alternate swarming dimensions for the stable bot from
+    `generate_buildbot_json.py`, locking it to the new driver version.
+
+Note that we leave the experimental bot in place. We could reclaim it, but it
+seems worthwhile to continuously test the "next" version of graphics drivers as
+well as the current stable ones.
 
 [updating the pixel tests]: https://www.chromium.org/developers/testing/gpu-testing/#TOC-Updating-and-Adding-New-Pixel-Tests-to-the-GPU-Bots