Here is a checklist of all the different services used by the HUD. Ask @janeyx99 or @suo for help getting access to these services.
- ClickHouse: primary data and metrics backend.
- Vercel: hosting the website. If you are a metamate, make a post like this in the Open Source - Support group to get access to Vercel.
- Sematext: log drain for our Vercel instance.
- AWS: data pipelines for populating ClickHouse, Lambda, S3, etc.
- Install
yarn
, which we use for package and project management. - Install the required dependencies for the project:
yarn install
-
You will need to set up your
.env.local
file with various keys and permissions. Follow the instructions in.env.example
. -
Run the development server
yarn dev
Open http://localhost:3000 with your browser to see the
result! Any edits you make to the code will be reflected immediately in the
browser. You can also run our test suite with yarn test
.
You can find additional yarn commands in package.json
under the scripts
section, such as yarn test
to run the test suite.
We use Next.js as our framework. To learn more about Next.js, take a look at the following resources:
- Next.js Documentation - learn about Next.js features and API.
- Learn Next.js - an interactive Next.js tutorial.
To run tests first make sure you're in the torchci
folder and then:
- To run all tests:
yarn test
- To run all tests in a specific file:
yarn test <path-to-file>
- e.g.
yarn test test/autoLabelBot.test.ts
- To run a specific test in a specific file:
yarn test <path-to-file> -t "<part-of-test-name>"
- e.g.
yarn test test/autoLabelBot.test.ts -t "triage"
- Note: This will run all tests that contain the string you entered
The easiest way to develop probot actions is to use nock
to mock out
interactions with the GitHub API and develop completely locally. If you do
need real webhooks, the easiest thing to do is follow these
instructions
to configure a repo to send webhooks to a Smee proxy, which will then forward
them to your local server.
We use Vercel as our deployment platform. Pushes
to main
and any other branches will automatically be deployed to Vercel; check out
the bot comments for how to view.
Logs for the Vercel instance can be found in Sematext.
If you are familiar with the old setup for Rockset, ClickHouse does not have
versioned query lambdas. Instead, queries are defined in clickhouse_queries/
and HUD sends the entire query text to ClickHouse in the same way Rockset did
for queries not defined using a query lambda.
Each query should have a folder in clickhouse_queries/
with two files: one
containing the query and the other containing a json dictionary with a
dictionary params
, mapping parameters to their types, and a list tests
of
sample values for the query.
To edit the query, only these files need to be changed. The change will be reflected immediately in your local development and in the Vercel preview when you submit your PR.
If you want to test your query in ClickHouse Cloud's console, you need to copy the query text into the console. If you make changes, you will have to copy the query back into the file.
To get access to ClickHouse Cloud's console, please see here.
An example params.json
file with params and tests:
{
"params": {
"A": "DateTime64(3)"
},
"tests": [
{"A": "2024-01-01 00:00:00.000"},
{"A": "2024-01-07 00:00:00.000"},
{"A": "2025-01-01 00:00:00.000"},
{"A": {"from_now": 0}}
]
}
A test can set a parameter to be a dictionary with the field from_now
to get a
dynamic timestamp where the entry is the difference from now in days. For
example from_now: 0
is now and from_now: -7
would be 7 days in the past.
Code is in test-infra/tools/torchci/check_alerts.py
. It queries HUD, filters out pending jobs, and then checks to see if there are 2 consecutive
SHAs that have the same failing job. If it does, it will either create a new Github Issue or update the existing
Github Issue.
A Meta internal Butterfly bot rule will trigger when the task is created or updated to assign the task to the oncall to notify the DevX team.
Butterfly bot links:
- When a new alert is created
- When pytorch/pytorch failures are edited
- When flaky test detector bot alerts are edited
If you ever need to modify the deployment settings like the oauth callbacks, domain names, there's a few places that you need to change these settings in. Here's a list:
- DNS Registry/Certificates (Contact the the OSS team)
- Environment Variables
- OAuth Project / OAuth Project Local
- Domain Management