Getting Started

Prerequisites

Here is a checklist of all the different services used by the HUD. Ask @janeyx99 or @suo for help getting access to these services.

ClickHouse: primary data and metrics backend.
Vercel: hosting the website. If you are a metamate, make a post like this in the Open Source - Support group to get access to Vercel.
Sematext: log drain for our Vercel instance.
AWS: data pipelines for populating ClickHouse, Lambda, S3, etc.

Quickstart

Install yarn, which we use for package and project management.
Install the required dependencies for the project:

yarn install

You will need to set up your .env.local file with various keys and permissions. Follow the instructions in .env.example.
Run the development server

yarn dev

Open http://localhost:3000 with your browser to see the result! Any edits you make to the code will be reflected immediately in the browser. You can also run our test suite with yarn test.

You can find additional yarn commands in package.json under the scripts section, such as yarn test to run the test suite.

We use Next.js as our framework. To learn more about Next.js, take a look at the following resources:

Next.js Documentation - learn about Next.js features and API.
Learn Next.js - an interactive Next.js tutorial.

Testing

To run tests first make sure you're in the torchci folder and then:

To run all tests:
- yarn test
To run all tests in a specific file:
- yarn test <path-to-file>
- e.g. yarn test test/autoLabelBot.test.ts
To run a specific test in a specific file:
- yarn test <path-to-file> -t "<part-of-test-name>"
- e.g. yarn test test/autoLabelBot.test.ts -t "triage"
- Note: This will run all tests that contain the string you entered

Testing Probot

The easiest way to develop probot actions is to use nock to mock out interactions with the GitHub API and develop completely locally. If you do need real webhooks, the easiest thing to do is follow these instructions to configure a repo to send webhooks to a Smee proxy, which will then forward them to your local server.

Deployment and monitoring

We use Vercel as our deployment platform. Pushes to main and any other branches will automatically be deployed to Vercel; check out the bot comments for how to view.

Logs for the Vercel instance can be found in Sematext.

How to edit ClickHouse queries

If you are familiar with the old setup for Rockset, ClickHouse does not have versioned query lambdas. Instead, queries are defined in clickhouse_queries/ and HUD sends the entire query text to ClickHouse in the same way Rockset did for queries not defined using a query lambda.

Each query should have a folder in clickhouse_queries/ with two files: one containing the query and the other containing a json dictionary with a dictionary params, mapping parameters to their types, and a list tests of sample values for the query.

To edit the query, only these files need to be changed. The change will be reflected immediately in your local development and in the Vercel preview when you submit your PR.

If you want to test your query in ClickHouse Cloud's console, you need to copy the query text into the console. If you make changes, you will have to copy the query back into the file.

To get access to ClickHouse Cloud's console, please see here.

`params.json`

An example params.json file with params and tests:

{
  "params": {
    "A": "DateTime64(3)"
  },
  "tests": [
    {"A": "2024-01-01 00:00:00.000"},
    {"A": "2024-01-07 00:00:00.000"},
    {"A": "2025-01-01 00:00:00.000"},
    {"A": {"from_now": 0}}
  ]
}

A test can set a parameter to be a dictionary with the field from_now to get a dynamic timestamp where the entry is the difference from now in days. For example from_now: 0 is now and from_now: -7 would be 7 days in the past.

Alerts

Code is in test-infra/tools/torchci/check_alerts.py. It queries HUD, filters out pending jobs, and then checks to see if there are 2 consecutive SHAs that have the same failing job. If it does, it will either create a new Github Issue or update the existing Github Issue.

A Meta internal Butterfly bot rule will trigger when the task is created or updated to assign the task to the oncall to notify the DevX team.

Butterfly bot links:

When a new alert is created
When pytorch/pytorch failures are edited
When flaky test detector bot alerts are edited

Modifying Deployment Settings

If you ever need to modify the deployment settings like the oauth callbacks, domain names, there's a few places that you need to change these settings in. Here's a list:

DNS Registry/Certificates (Contact the the OSS team)
Environment Variables
OAuth Project / OAuth Project Local
Domain Management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Getting Started

Prerequisites

Quickstart

Testing

Testing Probot

Deployment and monitoring

How to edit ClickHouse queries

`params.json`

Alerts

Modifying Deployment Settings

Files

README.md

Latest commit

History

README.md

File metadata and controls

Getting Started

Prerequisites

Quickstart

Testing

Testing Probot

Deployment and monitoring

How to edit ClickHouse queries

params.json

Alerts

Modifying Deployment Settings

`params.json`