Onboarding

Repositories can onboard the Issue Labeler by configuring the necessary settings in the repository and creating the necessary GitHub workflows using the reference examples below. Once the workflows are merged into the repository, only a few steps are necessary to train and promote the models for predictions to begin. The entire process typically takes less than 2 hours for very large repositories.

Required GitHub Actions settings

Using the Issue Labeler's GitHub Actions requires some settings in GitHub to be configured before onboarding can be completed.

Starting from the repository to be onboarded, navigate to Settings > Actions > General (https://github.com/org/repo/settings/actions).

Actions permissions

These are the only settings required for running the Issue Labeler workflows.

Choose: Allow enterprise, and select non-enterprise, actions and reusable workflows
- Enable: Allow actions created by GitHub
- If onboarding a repository outside the dotnet organization, enable: Allow specified actions and reusable workflows: dotnet/issue-labeler/*
Click Save

Approval for running fork pull request workflows from contributors

While unrelated to Issue Labeler, it is recommended to select Require approval for all external contributors. If a pull request from an external contributor prompts for approving its workflow runs, the PR's code should be thoroughly reviewed before approving the workflow run, as there are security implications to consider and it is not typical for pull requests to require such approvals unless the PR is expected to introduce new GitHub workflows.

Reference documentation:

Workflow permissions

While unrelated to Issue Labeler, it is recommended to disable Allow GitHub Actions to create and approve pull requests unless the repository has explicitly configured a workflow for such purpose.

GitHub Workflows to add

With the required GitHub Actions settings configured, the Issue Labeler can be onboarded by adding the following workflow files into your repository. This is entirely self-service.

`/.github/workflows/labeler-train.yml`

This single workflow is manually triggered from the Actions page. Training can be scoped to issues, pull requests, or both (default). The 'Download Data', 'Train Model', and 'Test Model' steps can be run individually, or all steps can be run (default).

When using the defaults to process all steps for both issues and pull requests, the single workflow run will do all the work necessary to prepare a repository for predicting labels on issues and pull requests. Repositories with around 100,000 issues/pulls typically complete the training process in about 2 hours.

Configuration changes to be made to the reference example:

env: LABEL_PREFIX: Change the value to match the area label naming convention for the repository. The prefix must end in something other than a letter or number.

By default, the workflow will save the new data and models into staged slots within the cache. The approach of training new models into a staged slot enables the new model to be tested without disrupting ongoing labeling in the repository. Once a new model is confirmed to meet expectations, it can be promoted to 'ACTIVE'.

# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Train the Issues and Pull Requests models for label prediction
name: "Labeler: Training"

on:
  workflow_dispatch:
    inputs:
      type:
        description: "Issues or Pull Requests"
        type: choice
        required: true
        default: "Both"
        options:
          - "Both"
          - "Issues"
          - "Pull Requests"

      steps:
        description: "Training Steps"
        type: choice
        required: true
        default: "All"
        options:
          - "All"
          - "Download Data"
          - "Train Model"
          - "Test Model"

      repository:
        description: "The org/repo to download data from. Defaults to the current repository."
      limit:
        description: "Max number of items to download for training/testing the model (newest items are used). Defaults to the max number of pages times the page size."
        type: number
      page_size:
        description: "Number of items per page in GitHub API requests. Defaults to 100 for issues, 25 for pull requests."
        type: number
      page_limit:
        description: "Maximum number of pages to download for training/testing the model. Defaults to 1000 for issues, 4000 for pull requests."
        type: number
      cache_key_suffix:
        description: "The cache key suffix to use for staged data/models (use 'ACTIVE' to bypass staging). Defaults to 'staged'."
        required: true
        default: "staged"

env:
  CACHE_KEY: ${{ inputs.cache_key_suffix }}
  REPOSITORY: ${{ inputs.repository || github.repository }}
  LABEL_PREFIX: "area-"
  THRESHOLD: "0.40"
  LIMIT: ${{ inputs.limit }}
  PAGE_SIZE: ${{ inputs.page_size }}
  PAGE_LIMIT: ${{ inputs.page_limit }}
  EXCLUDED_AUTHORS: "" # Comma-separated list of authors to exclude from training data

jobs:
  download-issues:
    if: ${{ contains(fromJSON('["Both", "Issues"]'), inputs.type) && contains(fromJSON('["All", "Download Data"]'), inputs.steps) }}
    runs-on: ubuntu-latest
    permissions:
      issues: read
    steps:
      - name: "Download Issues"
        uses: dotnet/issue-labeler/download@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "issues"
          cache_key: ${{ env.CACHE_KEY }}
          repository: ${{ env.REPOSITORY }}
          label_prefix: ${{ env.LABEL_PREFIX }}
          limit: ${{ env.LIMIT }}
          page_size: ${{ env.PAGE_SIZE }}
          page_limit: ${{ env.PAGE_LIMIT }}
          excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
        env:
          GITHUB_TOKEN: ${{ github.token }}

  download-pulls:
    if: ${{ contains(fromJSON('["Both", "Pull Requests"]'), inputs.type) && contains(fromJSON('["All", "Download Data"]'), inputs.steps) }}
    runs-on: ubuntu-latest
    permissions:
      pull-requests: read
    steps:
      - name: "Download Pull Requests"
        uses: dotnet/issue-labeler/download@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "pulls"
          cache_key: ${{ env.CACHE_KEY }}
          repository: ${{ env.REPOSITORY }}
          label_prefix: ${{ env.LABEL_PREFIX }}
          limit: ${{ env.LIMIT }}
          page_size: ${{ env.PAGE_SIZE }}
          page_limit: ${{ env.PAGE_LIMIT }}
          excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
        env:
          GITHUB_TOKEN: ${{ github.token }}

  train-issues:
    if: ${{ always() && contains(fromJSON('["Both", "Issues"]'), inputs.type) && contains(fromJSON('["All", "Train Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.download-issues.result) }}
    runs-on: ubuntu-latest
    permissions: {}
    needs: download-issues
    steps:
      - name: "Train Model for Issues"
        uses: dotnet/issue-labeler/train@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "issues"
          data_cache_key: ${{ env.CACHE_KEY }}
          model_cache_key: ${{ env.CACHE_KEY }}

  train-pulls:
    if: ${{ always() && contains(fromJSON('["Both", "Pull Requests"]'), inputs.type) && contains(fromJSON('["All", "Train Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.download-pulls.result) }}
    runs-on: ubuntu-latest
    permissions: {}
    needs: download-pulls
    steps:
      - name: "Train Model for Pull Requests"
        uses: dotnet/issue-labeler/train@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "pulls"
          data_cache_key: ${{ env.CACHE_KEY }}
          model_cache_key: ${{ env.CACHE_KEY }}

  test-issues:
    if: ${{ always() && contains(fromJSON('["Both", "Issues"]'), inputs.type) && contains(fromJSON('["All", "Test Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.train-issues.result) }}
    runs-on: ubuntu-latest
    permissions:
      issues: read
    needs: train-issues
    steps:
      - name: "Test Model for Issues"
        uses: dotnet/issue-labeler/test@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "issues"
          cache_key: ${{ env.CACHE_KEY }}
          repository: ${{ env.REPOSITORY }}
          label_prefix: ${{ env.LABEL_PREFIX }}
          threshold: ${{ env.THRESHOLD }}
          limit: ${{ env.LIMIT }}
          page_size: ${{ env.PAGE_SIZE }}
          page_limit: ${{ env.PAGE_LIMIT }}
          excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
        env:
          GITHUB_TOKEN: ${{ github.token }}

  test-pulls:
    if: ${{ always() && contains(fromJSON('["Both", "Pull Requests"]'), inputs.type) && contains(fromJSON('["All", "Test Model"]'), inputs.steps) && contains(fromJSON('["success", "skipped"]'), needs.train-pulls.result) }}
    runs-on: ubuntu-latest
    permissions:
      pull-requests: read
    needs: train-pulls
    steps:
      - name: "Test Model for Pull Requests"
        uses: dotnet/issue-labeler/test@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "pulls"
          cache_key: ${{ env.CACHE_KEY }}
          repository: ${{ env.REPOSITORY }}
          label_prefix: ${{ env.LABEL_PREFIX }}
          threshold: ${{ env.THRESHOLD }}
          limit: ${{ env.LIMIT }}
          page_size: ${{ env.PAGE_SIZE }}
          page_limit: ${{ env.PAGE_LIMIT }}
          excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
        env:
          GITHUB_TOKEN: ${{ github.token }}

`/.github/workflows/labeler-promote.yml`

This workflow will promote issue and/or pull request models into the ACTIVE cache slot to be used by predictions. The approach of training new models into a staged slot enables the new model to be tested without disrupting ongoing labeling in the repository. Once a new model is confirmed to meet expectations, it can be promoted.

# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Promote a model from staging to 'ACTIVE', backing up the currently 'ACTIVE' model
name: "Labeler: Promotion"

on:
  # Dispatched via the Actions UI, promotes the staged models from
  # a staged slot into the prediction environment
  workflow_dispatch:
    inputs:
      issues:
        description: "Issues: Promote Model"
        type: boolean
        required: true
      pulls:
        description: "Pulls: Promote Model"
        type: boolean
        required: true
      staged_key:
        description: "The cache key suffix to use for promoting a staged model to 'ACTIVE'. Defaults to 'staged'."
        required: true
        default: "staged"
      backup_key:
        description: "The cache key suffix to use for backing up the currently active model. Defaults to 'backup'."
        default: "backup"

permissions:
  actions: write

jobs:
  promote-issues:
    if: ${{ inputs.issues }}
    runs-on: ubuntu-latest
    steps:
      - name: "Promote Model for Issues"
        uses: dotnet/issue-labeler/promote@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "issues"
          staged_key: ${{ inputs.staged_key }}
          backup_key: ${{ inputs.backup_key }}

  promote-pulls:
    if: ${{ inputs.pulls }}
    runs-on: ubuntu-latest
    steps:
      - name: "Promote Model for Pull Requests"
        uses: dotnet/issue-labeler/promote@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: "pulls"
          staged_key: ${{ inputs.staged_key }}
          backup_key: ${{ inputs.backup_key }}

`/.github/workflows/labeler-predict-issues.yml`

Predict labels for issues as they are opened in the repository. This workflow can also be triggered manually to perform labeling in bulk using ranges of issue numbers.

Configuration changes to be made to the reference example:

env: LABEL_PREFIX: Change the value to match the area label naming convention for the repository. The prefix must end in something other than a letter or number.
env: DEFAULT_LABEL: Update the value or remove the line if the repository does not use a default label when no area can be predicted.

# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Predict labels for Issues using a trained model
name: "Labeler: Predict (Issues)"

on:
  # Only automatically predict area labels when issues are first opened
  issues:
    types: opened

  # Allow dispatching the workflow via the Actions UI, specifying ranges of numbers
  workflow_dispatch:
    inputs:
      issues:
        description: "Issue Numbers (comma-separated list of ranges)."
        required: true
      cache_key:
        description: "The cache key suffix to use for restoring the model. Defaults to 'ACTIVE'."
        required: true
        default: "ACTIVE"

env:
  # Do not allow failure for jobs triggered automatically (as this causes red noise on the workflows list)
  ALLOW_FAILURE: ${{ github.event_name == 'workflow_dispatch' }}

  LABEL_PREFIX: "area-"
  THRESHOLD: 0.40
  DEFAULT_LABEL: "needs-area-label"
  EXCLUDED_AUTHORS: "" # Comma-separated list of authors to exclude from training data

jobs:
  predict-issue-label:
    # Do not automatically run the workflow on forks outside the 'dotnet' org
    if: ${{ github.event_name == 'workflow_dispatch' || github.repository_owner == 'dotnet' }}
    runs-on: ubuntu-latest
    permissions:
      issues: write
    steps:
      - name: "Restore issues model from cache"
        id: restore-model
        uses: dotnet/issue-labeler/restore@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: issues
          fail-on-cache-miss: ${{ env.ALLOW_FAILURE }}
          quiet: true

      - name: "Predict issue labels"
        id: prediction
        if: ${{ steps.restore-model.outputs.cache-hit == 'true' }}
        uses: dotnet/issue-labeler/predict@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          issues: ${{ inputs.issues || github.event.issue.number }}
          label_prefix: ${{ env.LABEL_PREFIX }}
          threshold: ${{ env.THRESHOLD }}
          default_label: ${{ env.DEFAULT_LABEL }}
          excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
        env:
          GITHUB_TOKEN: ${{ github.token }}
        continue-on-error: ${{ !env.ALLOW_FAILURE }}

`/.github/workflows/labeler-predict-pulls.yml`

Predict labels for pull requests as they are opened in the repository. This workflow can also be triggered manually to perform labeling in bulk using ranges of pull request numbers.

Configuration changes to be made to the reference example:

env: LABEL_PREFIX: Change the value to match the area label naming convention for the repository. The prefix must end in something other than a letter or number.
env: DEFAULT_LABEL: Update the value or remove the line if the repository does not use a default label when no area can be predicted.

# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Predict labels for Pull Requests using a trained model
name: "Labeler: Predict (Pulls)"

on:
  # Per to the following documentation:
  # https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#pull_request_target
  #
  # The `pull_request_target` event runs in the context of the base of the pull request, rather
  # than in the context of the merge commit, as the `pull_request` event does. This prevents
  # execution of unsafe code from the head of the pull request that could alter the repository
  # or steal any secrets you use in your workflow. This event allows your workflow to do things
  # like label or comment on pull requests from forks.
  #
  # Only automatically predict area labels when pull requests are first opened
  pull_request_target:
    types: opened

  # Allow dispatching the workflow via the Actions UI, specifying ranges of numbers
  workflow_dispatch:
    inputs:
      pulls:
        description: "Pull Request Numbers (comma-separated list of ranges)."
        required: true
      cache_key:
        description: "The cache key suffix to use for restoring the model. Defaults to 'ACTIVE'."
        required: true
        default: "ACTIVE"

env:
  # Do not allow failure for jobs triggered automatically (this can block PR merge)
  ALLOW_FAILURE: ${{ github.event_name == 'workflow_dispatch' }}

  LABEL_PREFIX: "area-"
  THRESHOLD: 0.40
  DEFAULT_LABEL: "needs-area-label"
  EXCLUDED_AUTHORS: "" # Comma-separated list of authors to exclude from training data

jobs:
  predict-pull-label:
    # Do not automatically run the workflow on forks outside the 'dotnet' org
    if: ${{ github.event_name == 'workflow_dispatch' || github.repository_owner == 'dotnet' }}
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - name: "Restore pulls model from cache"
        id: restore-model
        uses: dotnet/issue-labeler/restore@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: pulls
          fail-on-cache-miss: ${{ env.ALLOW_FAILURE }}
          quiet: true

      - name: "Predict pull labels"
        id: prediction
        if: ${{ steps.restore-model.outputs.cache-hit == 'true' }}
        uses: dotnet/issue-labeler/predict@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          pulls: ${{ inputs.pulls || github.event.number }}
          label_prefix: ${{ env.LABEL_PREFIX }}
          threshold: ${{ env.THRESHOLD }}
          default_label: ${{ env.DEFAULT_LABEL }}
          excluded_authors: ${{ env.EXCLUDED_AUTHORS }}
        env:
          GITHUB_TOKEN: ${{ github.token }}
        continue-on-error: ${{ !env.ALLOW_FAILURE }}

`/.github/workflows/labeler-cache-retention.yml`

Restores the prediction models from cache, failing if any of the cache entries is missing. This workflow should be called on a daily cron schedule.

Configuration changes to be made to the reference example:

cron: Change the minute and/or hour values (and the comment) to an arbitrary time (this is recommended by GitHub)

# Workflow template imported and updated from:
# https://github.com/dotnet/issue-labeler/wiki/Onboarding
#
# Regularly restore the prediction models from cache to prevent cache eviction
name: "Labeler: Cache Retention"

# For more information about GitHub's action cache limits and eviction policy, see:
# https://docs.github.com/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy

on:
  schedule:
    - cron: "42 18 * * *" # 18:42 every day (arbitrary time daily)

  workflow_dispatch:
    inputs:
      cache_key:
        description: "The cache key suffix to use for restoring the model from cache. Defaults to 'ACTIVE'."
        required: true
        default: "ACTIVE"

env:
  CACHE_KEY: ${{ inputs.cache_key || 'ACTIVE' }}

jobs:
  restore-cache:
    # Do not automatically run the workflow on forks outside the 'dotnet' org
    if: ${{ github.event_name == 'workflow_dispatch' || github.repository_owner == 'dotnet' }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        type: ["issues", "pulls"]
    steps:
      - uses: dotnet/issue-labeler/restore@46125e85e6a568dc712f358c39f35317366f5eed # v2.0.0
        with:
          type: ${{ matrix.type }}
          cache_key: ${{ env.CACHE_KEY }}
          fail-on-cache-miss: true

Training the models

To train the issue and pull models, navigate to the Actions page for the repository and select Labeler: Training in the list of workflows on the left.

A blue banner will be displayed indicating, "This workflow has a workflow_dispatch event trigger." Click Run workflow. Leaving all of the inputs on their defaults will conduct the entire download/train/test process for both issues and pull requests.

Click the Run workflow button to start the training process. Progress can be monitored from the workflow run's details page.

Once the workflow completes, the result will be a pair of models saved into the GitHub Action Cache using a 'staged' cache key suffix. There will also be data files saved into the GitHub Action Cache, also using the 'staged' cache key suffix.

Reviewing the test results

Within the workflow run's summary, results from the download, train, and test steps will be presented as those jobs complete for both issues and pull requests. The test summaries show data capturing the prediction accuracy against existing data in the repository, with notes about whether the results are considered favorable.

The results show:

Matches: The predicted label matches the existing label, including when no prediction is made and there is no existing label. Correct prediction.
Mismatches: The predicted label does not match the existing label. Incorrect prediction.
No Prediction: No prediction was made, but the existing item had a label. Incorrect prediction.
No Existing Label: A prediction was made, but there was no existing label. Incorrect prediction.

If the Matches percentage is at least 65% and the Mismatches percentage is less than 10%, the model testing is considered favorable.

If your repository's results are less favorable than 65% Matches, it is recommended you review your existing issues' and pulls' labels to ensure they are labeled accurately. After refining labels, the Labeler: Train Models workflow can be re-run to review the new results.

When re-running training, either delete the existing 'staged' entries from GitHub's Action Cache, or use a new cache key on the subsequent runs. If retraining is run while conflicting cache entries exist, the job summary will provide guidance.

Promoting the models

Once models are trained with favorable results, they can be promoted into the ACTIVE cache entries to be consumed by the prediction workflows. From the Actions page, select Labeler: Promotion in the list of workflows on the left.

A blue banner will be displayed indicating, "This workflow has a workflow_dispatch event trigger." Click Run workflow. The checkboxes for Issues: Promote Model and Pulls: Promote Model are disabled by default. By checking both boxes and clicking Run workflow, the models trained and staged above will be promoted into immediate use by the prediction workflows.

The promotion workflow offers the ability to create a backup of any existing 'ACTIVE' models. If needed, the promotion workflow can promote from the 'backup' key suffix back into 'ACTIVE'.

Cache retention

The cache retention workflow that was added is configured to run on a daily schedule, ensuring that the trained models are restored from cache at least once daily to prevent cache evictions after 7 days of no use.

It is recommended to manually run the cache retention workflow after onboarding to test the workflow in your repository.

From the Actions page, select Labeler: Cache Retention from the list of workflows on the left. Choose Run workflow and click the Run workflow button.

Predict issue and pull labels

The Labeler: Predict (Issues) and Labeler: Predict (Pulls) workflows can be invoked manually through GitHub's Actions page, and they will also run automatically when new issues and pull requests are opened.

When running manually, a comma-separated list of number ranges can be entered, or the field can be left empty to run prediction over all issues/pulls that do not have an appropriate label. After onboarding, if there are issues or pulls that have not already been labeled, these workflows can be run to fill in those gaps and test the results of the Issue Labeler over new issues/pulls.

When running bulk prediction jobs, be aware that GitHub's API Rate Limit will apply and cause requests for downloading issues/pulls and updating labels to fail. This may cause the job to fail or be delayed while a back-off retry strategy is applied. Expect to be able to process about 2000 issues or pull requests per hour before the rate limit is reached.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Onboarding

Required GitHub Actions settings

Actions permissions

Approval for running fork pull request workflows from contributors

Workflow permissions

GitHub Workflows to add

`/.github/workflows/labeler-train.yml`

`/.github/workflows/labeler-promote.yml`

`/.github/workflows/labeler-predict-issues.yml`

`/.github/workflows/labeler-predict-pulls.yml`

`/.github/workflows/labeler-cache-retention.yml`

Training the models

Reviewing the test results

Promoting the models

Cache retention

Predict issue and pull labels

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally