fix: improve account analysis and reporting#15174
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request significantly modifies the bot detection workflow to increase detection frequency and improve accuracy. The workflow now runs hourly instead of daily, and analyzes a shorter 6-hour time window instead of 3 days. The changes include improved API pagination, better filtering logic, enhanced error handling, and issue deduplication to prevent creating multiple alert issues per day.
Changes:
- Changed execution frequency from daily to hourly with reduced time window (3 days → 6 hours)
- Added pagination support for API calls to handle large datasets
- Implemented issue deduplication to update existing daily reports instead of creating duplicates
- Added filtering for closed issues/PRs to reduce alert fatigue
- Fixed API endpoint (
listByRepo→listForRepo) and added missingletdeclaration
Comments suppressed due to low confidence (4)
.github/workflows/bot-detection.yml:105
- The pagination logic for review comments fetches all pages without any limit. For PRs with thousands of review comments, this could:
- Cause performance issues and long execution times
- Significantly increase API usage
- Potentially hit API rate limits when combined with hourly execution
Consider adding a reasonable limit on the number of review comments fetched per PR (e.g., first 500 comments) to balance thoroughness with performance. This is especially important given the hourly execution schedule.
const reviewComments = [];
if (github.paginate?.iterator) {
for await (const response of github.paginate.iterator(github.rest.pulls.listReviewComments, {
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
per_page: 100,
})) {
reviewComments.push(...response.data);
}
} else {
const { data } = await github.rest.pulls.listReviewComments({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
per_page: 100,
});
reviewComments.push(...data);
}
.github/workflows/bot-detection.yml:6
- The cron schedule has changed from daily ("17 3 * * *") to hourly ("0 * * * *"), which represents a 24x increase in workflow execution frequency. This change should be carefully considered for its impact on:
- API rate limits: Running hourly will make 24x more API calls to GitHub's API, which could hit rate limits, especially with the pagination logic fetching all comments
- Cost implications: More workflow runs means higher GitHub Actions minutes usage
- Alert fatigue: Hourly bot detection reports could create excessive noise if many new accounts are detected
Consider whether hourly detection is necessary, or if a less frequent schedule (e.g., every 6 hours) would be more appropriate. Also ensure this aligns with GitHub API rate limit constraints given the number of API calls made per run.
- cron: "0 * * * *"
.github/workflows/bot-detection.yml:274
- The issue deduplication logic searches only the first 100 open issues (per_page: 100, line 268) without pagination. In repositories with more than 100 open issues, this could fail to find an existing bot detection issue with today's title, leading to duplicate issues being created.
Consider either:
- Adding pagination to ensure all open issues are checked
- Using the Search API with a query filter for the specific title
- Adding a distinctive label (e.g., "bot-detection-daily") and searching for issues with that label to reduce the search space
try {
const { data: existingIssues } = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open',
per_page: 100,
});
const existing = existingIssues.find(i => i.title === title);
if (existing?.number) {
existingIssueNumber = existing.number;
}
.github/workflows/bot-detection.yml:25
- The time window has been reduced from 3 days (72 hours) to 6 hours, which is a 92% reduction. Combined with the change to hourly execution, this represents a fundamental shift in detection strategy:
Previous approach: Daily scan of 3-day activity window
New approach: Hourly scan of 6-hour activity window
This change has significant implications:
- Coverage: The new approach only looks at recent 6-hour windows, potentially missing bot activity that occurs outside those windows
- Detection delay: Bots that are active but space their activity beyond 6 hours apart could be missed
- False negatives: Coordinated bot campaigns that operate in waves could slip through the narrower time window
Consider whether a 6-hour window is sufficient for detecting bot patterns, or if a longer window (e.g., 12-24 hours) would provide better detection while still supporting hourly execution.
const HOURS_BACK = 6;
const MAX_PR = 50;
const MIN_ACCOUNT_AGE_DAYS = 14;
const cutoff = new Date(Date.now() - HOURS_BACK * 60 * 60 * 1000);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const MAX_PR = 200; | ||
| const MIN_ACCOUNT_AGE_DAYS = 7; | ||
| const HOURS_BACK = 6; | ||
| const MAX_PR = 50; |
There was a problem hiding this comment.
The MAX_PR limit has been reduced from 200 to 50, which is a 75% reduction in the number of PRs scanned. Combined with the 6-hour time window, this significantly limits the scope of bot detection:
Impact: In active repositories, 50 PRs might only represent a small fraction of activity in a 6-hour window. Bots operating on PRs outside this limit won't be detected.
Consider whether 50 PRs is sufficient for your repository's activity level. If the repository receives more than 50 PR updates in 6 hours, you should increase this limit or implement a more sophisticated filtering strategy (e.g., prioritize newly created PRs over updated ones).
This issue also appears in the following locations of the same file:
- line 6
- line 21
| // Skip alerting if everything found is already closed. | ||
| let hasAnyOpenItem = false; | ||
| for (const [, data] of highRiskAccounts) { | ||
| if (data.issues?.some(i => i.state === 'open')) { | ||
| hasAnyOpenItem = true; | ||
| break; | ||
| } | ||
| if (data.prs?.some(p => p.state === 'open')) { | ||
| hasAnyOpenItem = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| if (!hasAnyOpenItem) { | ||
| console.log('No open issues or PRs from new accounts; skipping alert issue.'); | ||
| appendSummary('Bot Detection: flagged new accounts, but all related issues/PRs are closed. No alert issue created.'); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The logic to skip alerting when all issues/PRs are closed (lines 198-215) is a good improvement that reduces alert fatigue. However, consider that accounts with closed items might still warrant investigation - a bot that quickly creates and closes spam issues could bypass detection with this logic.
Consider adding a threshold or alternative check, such as:
- Only skip if items were closed by the account owner themselves (not by moderators)
- Only skip if items have been closed for more than a certain time period
- Track the closure patterns in the report even if not creating an alert
This would help identify sophisticated bots that clean up after themselves.
| const title = `🚨 HIGH RISK: Brand New Accounts — ${today}`; | ||
| let existingIssueNumber; |
There was a problem hiding this comment.
Inconsistent indentation: lines 260-261 are indented with additional spaces compared to line 259. These lines should align with the same indentation level as line 259 since they are all part of the same code block. This creates inconsistent formatting and could be confusing for maintainers.
| const title = `🚨 HIGH RISK: Brand New Accounts — ${today}`; | |
| let existingIssueNumber; | |
| const title = `🚨 HIGH RISK: Brand New Accounts — ${today}`; | |
| let existingIssueNumber; |
| const issueComments = []; | ||
| if (github.paginate?.iterator) { | ||
| for await (const response of github.paginate.iterator(github.rest.issues.listComments, { | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: pr.number, | ||
| per_page: 100, | ||
| })) { | ||
| issueComments.push(...response.data); | ||
| } | ||
| } else { | ||
| const { data } = await github.rest.issues.listComments({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: pr.number, | ||
| per_page: 100, | ||
| }); | ||
| issueComments.push(...data); | ||
| } |
There was a problem hiding this comment.
The pagination logic for issue comments and review comments fetches all pages without any limit. For PRs with thousands of comments, this could:
- Cause performance issues and long execution times
- Significantly increase API usage
- Potentially hit API rate limits when combined with hourly execution
Consider adding a reasonable limit on the number of comments fetched per PR (e.g., first 500 comments) to balance thoroughness with performance. This is especially important given the hourly execution schedule.
This issue also appears on line 87 of the same file.
| @@ -135,21 +182,41 @@ jobs: | |||
| per_page: 100, | |||
| }); | |||
| data.prs = prList | |||
| .filter(p => p.user?.login === login) | |||
| .filter(p => p.user?.login === login && new Date(p.created_at) >= cutoff) | |||
| .map(p => ({ | |||
| number: p.number, | |||
| title: p.title, | |||
| state: p.state, | |||
| created_at: p.created_at, | |||
| html_url: p.html_url, | |||
| })); | |||
There was a problem hiding this comment.
The PR fetching logic fetches all PRs in the repository with only per_page: 100, then client-side filters for the specific user. This approach has two problems:
- It only fetches the first 100 PRs total (not paginated), so it will miss PRs from the user if they're not in the most recent 100 PRs
- It's inefficient - fetching all PRs when you only need one user's PRs
The API doesn't support filtering by creator for pulls.list, but you could use the Search API instead: github.rest.search.issuesAndPullRequests with query repo:owner/repo type:pr author:login which supports pagination and filtering server-side. Alternatively, if the user typically has few PRs, consider using github.rest.pulls.list with pagination and filtering client-side, but fetch more than 100 results.
This issue also appears on line 263 of the same file.
| } catch (e) { | ||
| console.log('Issue create/update with labels failed; retrying without labels...'); | ||
| if (existingIssueNumber) { | ||
| await github.rest.issues.update({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: existingIssueNumber, | ||
| body, | ||
| }); | ||
| } else { | ||
| await github.rest.issues.create({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| title, | ||
| body, | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
The error handling in the retry logic (catch block at line 296) doesn't prevent errors from propagating. If the retry attempt also fails, the error will be unhandled and could crash the workflow. Consider:
- Wrapping the retry attempts in their own try-catch blocks
- Logging the final error if all retry attempts fail
- Using appendSummary to report the failure in the workflow summary
For example, the inner attempts (lines 299-311) should have their own try-catch to ensure failures are logged properly.
| @@ -1,10 +1,9 @@ | |||
| name: Bot Detection | |||
There was a problem hiding this comment.
The workflow description line with DOI reference has been removed. This removes valuable academic attribution and context about the bot detection methodology. Consider keeping this reference as a comment within the workflow or in accompanying documentation, as it provides scientific credibility and helps maintainers understand the underlying detection approach.
| name: Bot Detection | |
| name: Bot Detection | |
| # Bot detection heuristics used in this workflow are based on published research. | |
| # For the full methodology and DOI reference, see docs/bot-detection-methodology.md. |
Uh oh!
There was an error while loading. Please reload this page.