layout | title | nav_order | parent |
---|---|---|---|
default |
Taskcluster |
1 |
Orchestrators |
Taskcluster is a Mozilla task execution framework. It powers Firefox CI and provides access to the hybrid cloud workers (GCP or on-prem) which increases scalability and observability compared to Snakemake.
We use Taskcluster taskgraph to define the DAG (Directly Acyclic Graph) of the pipeline steps.
-
Create a new branch in the git repo and push. It is useful to experiment with code and also not to get the caches invalidated if you need to restart training and some new changes were landed in the main branch.
-
Go to Github CI for the commit you want to run training for and find a Decision Task
- Go to CI and press "View task in Taskcluster". Make sure you are authenticated in the TC interface. It is required to run tasks. However, already running tasks can be viewed without authentication.
- In TC interface navigate to a parent Task Group
- Press "Train" in the 3-dot menu for actions
- Copy a config prepared in advance and press "train". See the example TC config here. You can find directions on how to configure training in the Model training guide.
- Look at the scheduled tasks. They should be visible under the Train action.
- Press any task. Here you can look at the logs and artifacts produced by the task.
- Navigate to a parent Task Group again (it is a different one than for the Train Action). Here you can see all the scheduled tasks in a more convenient interface with filtering.
Quite often you need to rerun the pipeline after making fixes or when a task fails.
It is possible to manually cancel a task with the Cancel task action.
After the fixes were implemented, push again and restart the pipeline with the same procedure as described in the "Running training" section.
Some steps might be already cached from the previous run depending on the fixes. For example if only a config setting that affects the last task was changed, or if nothing changed at all the pipeline might restart from the failed/cancelled step.
Warning: even a slight refactoring of the upstream steps can invalidate caches for the whole pipeline completely, so it's better to be careful with that when experimenting with the later stages of the pipeleine.
Change target-stage: all
in the training config to a stage that corresponds to another TC step.
For example, to download, clean and merge the training corpus use:
target-stage: merge-corpus
that corresponds to stage: merge-corpus
in /taskcluster/ci/merge-corpus/kind.yml:
tasks:
merge-corpus:
label: merge-corpus-{src_locale}-{trg_locale}
description: merge corpus for {src_locale}-{trg_locale}
attributes:
dataset-category: train
stage: merge-corpus
Taskcluster allows authorized users to run so-called interactive tasks. These tasks allow users to gain a shell in the same environment that a pipeline step runs in. This can often be useful for quicker debugging or testing of ideas.
To start an interactive task, follow these steps:
-
Go to the task you want an interactive version of, eg: https://firefox-ci-tc.services.mozilla.com/tasks/DZvVQ-VUTPSyPBBS13Bwfg
-
Click the "Edit" button in the three dots menu
-
Click "Edit" on the modal that pops up
-
Click the "Interactive" toggle in the top left
-
Reduce the maxRunTime to a best guess at how long you'll need the task and worker running for. (We pay for every minute a worker runs - so they should not be kept running, eg: overnight.)
-
Adjust the payload to simply run bash and sleep (instead of a full pipeline step). For docker-worker tasks use something like:
command:
- bash
- '-c'
- 'sleep 7200'
For generic-worker tasks (those needing a GPU), use:
command:
- - bash
- '-c'
- 'sleep 7200'
(docker-worker tasks have an image
section in the payload)
- Click "Create Task"
After a few minutes you should be able to get a shell (a link will show up in the tab when it's ready).