-
Notifications
You must be signed in to change notification settings - Fork 22
Onboarding
Repositories can onboard the Issue Labeler by configuring the necessary settings in the repository and creating the necessary GitHub workflows using the reference examples below. Once the workflows are merged into the repository, only a few steps are necessary to train and promote the models for predictions to begin. The entire process typically takes less than 2 hours for very large repositories.
Using the Issue Labeler's GitHub Actions requires some settings in GitHub to be configured before onboarding can be completed.
Starting from the repository to be onboarded, navigate to Settings > Actions > General (https://github.com/org/repo/settings/actions).
These are the only settings required for running the Issue Labeler workflows.
- Choose: Allow enterprise, and select non-enterprise, actions and reusable workflows
- Enable: Allow actions created by GitHub
-
If onboarding a repository outside the
dotnet
organization, enable: Allow specified actions and reusable workflows:dotnet/issue-labeler/*
- Click Save
While unrelated to Issue Labeler, it is recommended to select Require approval for all external contributors. If a pull request from an external contributor prompts for approving its workflow runs, the PR's code should be thoroughly reviewed before approving the workflow run, as there are security implications to consider and it is not typical for pull requests to require such approvals unless the PR is expected to introduce new GitHub workflows.
Reference documentation:
- Approving workflow runs from public forks - GitHub Docs
- Security hardening for GitHub Actions - GitHub Docs
- Keeping your GitHub Actions and workflows secure Part 1: Preventing pwn requests | GitHub Security Lab
While unrelated to Issue Labeler, it is recommended to disable Allow GitHub Actions to create and approve pull requests unless the repository has explicitly configured a workflow for such purpose.
With the required GitHub Actions settings configured, the Issue Labeler can be onboarded by adding the following workflow files into your repository. This is entirely self-service.
This single workflow is manually triggered from the Actions page. Training can be scoped to issues, pull requests, or both (default). The 'Download Data', 'Train Model', and 'Test Model' steps can be run individually, or all steps can be run (default).
When using the defaults to process all steps for both issues and pull requests, the single workflow run will do all the work necessary to prepare a repository for predicting labels on issues and pull requests. Repositories with around 100,000 issues/pulls typically complete the training process in about 2 hours.
Configuration changes to be made to the reference example:
-
env: LABEL_PREFIX
: Change the value to match the area label naming convention for the repository. The prefix must end in something other than a letter or number.
By default, the workflow will save the new data and models into staged
slots within the cache. The approach of training new models into a staged
slot enables the new model to be tested without disrupting ongoing labeling in the repository. Once a new model is confirmed to meet expectations, it can be promoted to 'ACTIVE'.
# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Train the Issues and Pull Requests models for label prediction
name: "Labeler: Training"
on:
workflow_dispatch:
inputs:
type:
description: "Issues or Pull Requests"
type: choice
required: true
default: "Both"
options:
- "Both"
- "Issues"
- "Pull Requests"
steps:
description: "Training Steps"
type: choice
required: true
default: "All"
options:
- "All"
- "Download Data"
- "Train Model"
- "Test Model"
repository:
description: "The org/repo to download data from. Defaults to the current repository."
limit:
description: "Max number of items to download for training/testing the model (newest items are used). Defaults to the max number of pages times the page size."
type: number
page_size:
description: "Number of items per page in GitHub API requests. Defaults to 100 for issues, 25 for pull requests."
type: number
page_limit:
description: "Maximum number of pages to download for training/testing the model. Defaults to 1000 for issues, 4000 for pull requests."
type: number
cache_key_suffix:
description: "The cache key suffix to use for staged data/models (use 'ACTIVE' to bypass staging). Defaults to 'staged'."
required: true
default: "staged"
env:
CACHE_KEY: ${{ inputs.cache_key_suffix }}
REPOSITORY: ${{ inputs.repository || github.repository }}
LABEL_PREFIX: "area-"
THRESHOLD: "0.40"
LIMIT: ${{ inputs.limit }}
PAGE_SIZE: ${{ inputs.page_size }}
PAGE_LIMIT: ${{ inputs.page_limit }}
EXCLUDED_AUTHORS: "" # Comma-separated list of authors to exclude from training data
jobs:
download-issues:
if: ${{ contains(fromJSON('["Both", "Issues"]'), inputs.type) && contains(fromJSON('["All", "Download Data"]'), inputs.steps) }}
runs-on: ubuntu-latest
permissions:
issues: read
steps:
- name: "Download Issues"
uses: dotnet/issue-labeler/download@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "issues"
cache_key: ${{ env.CACHE_KEY }}
repository: ${{ env.REPOSITORY }}
label_prefix: ${{ env.LABEL_PREFIX }}
limit: ${{ env.LIMIT }}
page_size: ${{ env.PAGE_SIZE }}
page_limit: ${{ env.PAGE_LIMIT }}
excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
env:
GITHUB_TOKEN: ${{ github.token }}
download-pulls:
if: ${{ contains(fromJSON('["Both", "Pull Requests"]'), inputs.type) && contains(fromJSON('["All", "Download Data"]'), inputs.steps) }}
runs-on: ubuntu-latest
permissions:
pull-requests: read
steps:
- name: "Download Pull Requests"
uses: dotnet/issue-labeler/download@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "pulls"
cache_key: ${{ env.CACHE_KEY }}
repository: ${{ env.REPOSITORY }}
label_prefix: ${{ env.LABEL_PREFIX }}
limit: ${{ env.LIMIT }}
page_size: ${{ env.PAGE_SIZE }}
page_limit: ${{ env.PAGE_LIMIT }}
excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
env:
GITHUB_TOKEN: ${{ github.token }}
train-issues:
if: ${{ always() && contains(fromJSON('["Both", "Issues"]'), inputs.type) && contains(fromJSON('["All", "Train Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.download-issues.result) }}
runs-on: ubuntu-latest
permissions: {}
needs: download-issues
steps:
- name: "Train Model for Issues"
uses: dotnet/issue-labeler/train@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "issues"
data_cache_key: ${{ env.CACHE_KEY }}
model_cache_key: ${{ env.CACHE_KEY }}
train-pulls:
if: ${{ always() && contains(fromJSON('["Both", "Pull Requests"]'), inputs.type) && contains(fromJSON('["All", "Train Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.download-pulls.result) }}
runs-on: ubuntu-latest
permissions: {}
needs: download-pulls
steps:
- name: "Train Model for Pull Requests"
uses: dotnet/issue-labeler/train@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "pulls"
data_cache_key: ${{ env.CACHE_KEY }}
model_cache_key: ${{ env.CACHE_KEY }}
test-issues:
if: ${{ always() && contains(fromJSON('["Both", "Issues"]'), inputs.type) && contains(fromJSON('["All", "Test Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.train-issues.result) }}
runs-on: ubuntu-latest
permissions:
issues: read
needs: train-issues
steps:
- name: "Test Model for Issues"
uses: dotnet/issue-labeler/test@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "issues"
cache_key: ${{ env.CACHE_KEY }}
repository: ${{ env.REPOSITORY }}
label_prefix: ${{ env.LABEL_PREFIX }}
threshold: ${{ env.THRESHOLD }}
limit: ${{ env.LIMIT }}
page_size: ${{ env.PAGE_SIZE }}
page_limit: ${{ env.PAGE_LIMIT }}
excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
env:
GITHUB_TOKEN: ${{ github.token }}
test-pulls:
if: ${{ always() && contains(fromJSON('["Both", "Pull Requests"]'), inputs.type) && contains(fromJSON('["All", "Test Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.train-pulls.result) }}
runs-on: ubuntu-latest
permissions:
pull-requests: read
needs: train-pulls
steps:
- name: "Test Model for Pull Requests"
uses: dotnet/issue-labeler/test@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "pulls"
cache_key: ${{ env.CACHE_KEY }}
repository: ${{ env.REPOSITORY }}
label_prefix: ${{ env.LABEL_PREFIX }}
threshold: ${{ env.THRESHOLD }}
limit: ${{ env.LIMIT }}
page_size: ${{ env.PAGE_SIZE }}
page_limit: ${{ env.PAGE_LIMIT }}
excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
env:
GITHUB_TOKEN: ${{ github.token }}
This workflow will promote issue and/or pull request models into the ACTIVE
cache slot to be used by predictions. The approach of training new models into a staged
slot enables the new model to be tested without disrupting ongoing labeling in the repository. Once a new model is confirmed to meet expectations, it can be promoted.
# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Promote a model from staging to 'ACTIVE', backing up the currently 'ACTIVE' model
name: "Labeler: Promotion"
on:
# Dispatched via the Actions UI, promotes the staged models from
# a staged slot into the prediction environment
workflow_dispatch:
inputs:
issues:
description: "Issues: Promote Model"
type: boolean
required: true
pulls:
description: "Pulls: Promote Model"
type: boolean
required: true
staged_key:
description: "The cache key suffix to use for promoting a staged model to 'ACTIVE'. Defaults to 'staged'."
required: true
default: "staged"
backup_key:
description: "The cache key suffix to use for backing up the currently active model. Defaults to 'backup'."
default: "backup"
permissions:
actions: write
jobs:
promote-issues:
if: ${{ inputs.issues }}
runs-on: ubuntu-latest
steps:
- name: "Promote Model for Issues"
uses: dotnet/issue-labeler/promote@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "issues"
staged_key: ${{ inputs.staged_key }}
backup_key: ${{ inputs.backup_key }}
promote-pulls:
if: ${{ inputs.pulls }}
runs-on: ubuntu-latest
steps:
- name: "Promote Model for Pull Requests"
uses: dotnet/issue-labeler/promote@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: "pulls"
staged_key: ${{ inputs.staged_key }}
backup_key: ${{ inputs.backup_key }}
Predict labels for issues as they are opened in the repository. This workflow can also be triggered manually to perform labeling in bulk using ranges of issue numbers.
Configuration changes to be made to the reference example:
-
env: LABEL_PREFIX
: Change the value to match the area label naming convention for the repository. The prefix must end in something other than a letter or number. -
env: DEFAULT_LABEL
: Update the value or remove the line if the repository does not use a default label when no area can be predicted.
# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Predict labels for Issues using a trained model
name: "Labeler: Predict (Issues)"
on:
# Only automatically predict area labels when issues are first opened
issues:
types: opened
# Allow dispatching the workflow via the Actions UI, specifying ranges of numbers
workflow_dispatch:
inputs:
issues:
description: "Issue Numbers (comma-separated list of ranges)."
required: true
cache_key:
description: "The cache key suffix to use for restoring the model. Defaults to 'ACTIVE'."
required: true
default: "ACTIVE"
env:
# Do not allow failure for jobs triggered automatically (as this causes red noise on the workflows list)
ALLOW_FAILURE: ${{ github.event_name == 'workflow_dispatch' }}
LABEL_PREFIX: "area-"
THRESHOLD: 0.40
DEFAULT_LABEL: "needs-area-label"
EXCLUDED_AUTHORS: "" # Comma-separated list of authors to exclude from training data
jobs:
predict-issue-label:
# Do not automatically run the workflow on forks outside the 'dotnet' org
if: ${{ github.event_name == 'workflow_dispatch' || github.repository_owner == 'dotnet' }}
runs-on: ubuntu-latest
permissions:
issues: write
steps:
- name: "Restore issues model from cache"
id: restore-model
uses: dotnet/issue-labeler/restore@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: issues
fail-on-cache-miss: ${{ env.ALLOW_FAILURE }}
quiet: true
- name: "Predict issue labels"
id: prediction
if: ${{ steps.restore-model.outputs.cache-hit == 'true' }}
uses: dotnet/issue-labeler/predict@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
issues: ${{ inputs.issues || github.event.issue.number }}
label_prefix: ${{ env.LABEL_PREFIX }}
threshold: ${{ env.THRESHOLD }}
default_label: ${{ env.DEFAULT_LABEL }}
excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
env:
GITHUB_TOKEN: ${{ github.token }}
continue-on-error: ${{ !env.ALLOW_FAILURE }}
Predict labels for pull requests as they are opened in the repository. This workflow can also be triggered manually to perform labeling in bulk using ranges of pull request numbers.
Configuration changes to be made to the reference example:
-
env: LABEL_PREFIX
: Change the value to match the area label naming convention for the repository. The prefix must end in something other than a letter or number. -
env: DEFAULT_LABEL
: Update the value or remove the line if the repository does not use a default label when no area can be predicted.
# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Predict labels for Pull Requests using a trained model
name: "Labeler: Predict (Pulls)"
on:
# Per to the following documentation:
# https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#pull_request_target
#
# The `pull_request_target` event runs in the context of the base of the pull request, rather
# than in the context of the merge commit, as the `pull_request` event does. This prevents
# execution of unsafe code from the head of the pull request that could alter the repository
# or steal any secrets you use in your workflow. This event allows your workflow to do things
# like label or comment on pull requests from forks.
#
# Only automatically predict area labels when pull requests are first opened
pull_request_target:
types: opened
# Allow dispatching the workflow via the Actions UI, specifying ranges of numbers
workflow_dispatch:
inputs:
pulls:
description: "Pull Request Numbers (comma-separated list of ranges)."
required: true
cache_key:
description: "The cache key suffix to use for restoring the model. Defaults to 'ACTIVE'."
required: true
default: "ACTIVE"
env:
# Do not allow failure for jobs triggered automatically (this can block PR merge)
ALLOW_FAILURE: ${{ github.event_name == 'workflow_dispatch' }}
LABEL_PREFIX: "area-"
THRESHOLD: 0.40
DEFAULT_LABEL: "needs-area-label"
EXCLUDED_AUTHORS: "" # Comma-separated list of authors to exclude from training data
jobs:
predict-pull-label:
# Do not automatically run the workflow on forks outside the 'dotnet' org
if: ${{ github.event_name == 'workflow_dispatch' || github.repository_owner == 'dotnet' }}
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- name: "Restore pulls model from cache"
id: restore-model
uses: dotnet/issue-labeler/restore@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: pulls
fail-on-cache-miss: ${{ env.ALLOW_FAILURE }}
quiet: true
- name: "Predict pull labels"
id: prediction
if: ${{ steps.restore-model.outputs.cache-hit == 'true' }}
uses: dotnet/issue-labeler/predict@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
pulls: ${{ inputs.pulls || github.event.number }}
label_prefix: ${{ env.LABEL_PREFIX }}
threshold: ${{ env.THRESHOLD }}
default_label: ${{ env.DEFAULT_LABEL }}
excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
env:
GITHUB_TOKEN: ${{ github.token }}
continue-on-error: ${{ !env.ALLOW_FAILURE }}
Restores the prediction models from cache, failing if any of the cache entries is missing. This workflow should be called on a daily cron schedule.
Configuration changes to be made to the reference example:
-
cron
: Change the minute and/or hour values (and the comment) to an arbitrary time (this is recommended by GitHub)
# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Regularly restore the prediction models from cache to prevent cache eviction
name: "Labeler: Cache Retention"
# For more information about GitHub's action cache limits and eviction policy, see:
# https://docs.github.com/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy
on:
schedule:
- cron: "42 18 * * *" # 18:42 every day (arbitrary time daily)
workflow_dispatch:
inputs:
cache_key:
description: "The cache key suffix to use for restoring the model from cache. Defaults to 'ACTIVE'."
required: true
default: "ACTIVE"
env:
CACHE_KEY: ${{ inputs.cache_key || 'ACTIVE' }}
jobs:
restore-cache:
# Do not automatically run the workflow on forks outside the 'dotnet' org
if: ${{ github.event_name == 'workflow_dispatch' || github.repository_owner == 'dotnet' }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
type: ["issues", "pulls"]
steps:
- uses: dotnet/issue-labeler/restore@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
with:
type: ${{ matrix.type }}
cache_key: ${{ env.CACHE_KEY }}
fail-on-cache-miss: true
To train the issue and pull models, navigate to the Actions page for the repository and select Labeler: Training in the list of workflows on the left.
A blue banner will be displayed indicating, "This workflow has a workflow_dispatch
event trigger." Click Run workflow. Leaving all of the inputs on their defaults will conduct the entire download/train/test process for both issues and pull requests.
Click the Run workflow button to start the training process. Progress can be monitored from the workflow run's details page.
Once the workflow completes, the result will be a pair of models saved into the GitHub Action Cache using a 'staged' cache key suffix. There will also be data files saved into the GitHub Action Cache, also using the 'staged' cache key suffix.
Within the workflow run's summary, results from the download, train, and test steps will be presented as those jobs complete for both issues and pull requests. The test summaries show data capturing the prediction accuracy against existing data in the repository, with notes about whether the results are considered favorable.
The results show:
- Matches: The predicted label matches the existing label, including when no prediction is made and there is no existing label. Correct prediction.
- Mismatches: The predicted label does not match the existing label. Incorrect prediction.
- No Prediction: No prediction was made, but the existing item had a label. Incorrect prediction.
- No Existing Label: A prediction was made, but there was no existing label. Incorrect prediction.
If the Matches percentage is at least 65% and the Mismatches percentage is less than 10%, the model testing is considered favorable.
If your repository's results are less favorable than 65% Matches, it is recommended you review your existing issues' and pulls' labels to ensure they are labeled accurately. After refining labels, the Labeler: Train Models workflow can be re-run to review the new results.
When re-running training, either delete the existing 'staged' entries from GitHub's Action Cache, or use a new cache key on the subsequent runs. If retraining is run while conflicting cache entries exist, the job summary will provide guidance.
Once models are trained with favorable results, they can be promoted into the ACTIVE cache entries to be consumed by the prediction workflows. From the Actions page, select Labeler: Promotion in the list of workflows on the left.
A blue banner will be displayed indicating, "This workflow has a workflow_dispatch
event trigger." Click Run workflow. The checkboxes for Issues: Promote Model and Pulls: Promote Model are disabled by default. By checking both boxes and clicking Run workflow, the models trained and staged above will be promoted into immediate use by the prediction workflows.
The promotion workflow offers the ability to create a backup of any existing 'ACTIVE' models. If needed, the promotion workflow can promote from the 'backup' key suffix back into 'ACTIVE'.
The cache retention workflow that was added is configured to run on a daily schedule, ensuring that the trained models are restored from cache at least once daily to prevent cache evictions after 7 days of no use.
It is recommended to manually run the cache retention workflow after onboarding to test the workflow in your repository.
From the Actions page, select Labeler: Cache Retention from the list of workflows on the left. Choose Run workflow and click the Run workflow button.
The Labeler: Predict (Issues) and Labeler: Predict (Pulls) workflows can be invoked manually through GitHub's Actions page, and they will also run automatically when new issues and pull requests are opened.
When running manually, a comma-separated list of number ranges can be entered, or the field can be left empty to run prediction over all issues/pulls that do not have an appropriate label. After onboarding, if there are issues or pulls that have not already been labeled, these workflows can be run to fill in those gaps and test the results of the Issue Labeler over new issues/pulls.
When running bulk prediction jobs, be aware that GitHub's API Rate Limit will apply and cause requests for downloading issues/pulls and updating labels to fail. This may cause the job to fail or be delayed while a back-off retry strategy is applied. Expect to be able to process about 2000 issues or pull requests per hour before the rate limit is reached.