-
-
Notifications
You must be signed in to change notification settings - Fork 1
Add GitHub repositories #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds GitHub repositories ingestion: DB migration, Postgres service, formatter/helpers, Octokit-based fetch implementation with runnable scripts, schedule/ingester wiring, and RSpec tests (includes a duplicated test helper definition). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Runner as Script
participant Impl as FetchRepositoriesFromGithub
participant GH as Octokit Client
participant Fmt as RepositoriesFormatter
participant SS as SharedStorage::Postgres
Runner->>Impl: execute
Impl->>Impl: initialize_client
Impl->>GH: fetch org_repos (auto-pagination)
GH-->>Impl: repositories[]
loop format repos
Impl->>Fmt: format(repo)
Fmt-->>Impl: normalized_record
end
alt no content
Impl-->>Runner: success { type: github_repository, content: [] }
else content present
Impl->>Impl: paginate_and_write (page_size=100)
loop each page
Impl->>SS: write(record { type, page_idx, total_pages, total_records, content })
SS-->>Impl: ack
end
Impl-->>Runner: done
end
note right of Impl: errors formatted and returned
sequenceDiagram
autonumber
participant Ingester as WarehouseIngester
participant Map as SERVICES
participant Service as Services::Postgres::GithubRepository
Ingester->>Map: lookup('github_repository')
Map-->>Ingester: { service: Service, external_key: 'external_github_repository_id' }
Ingester->>Service: upsert/insert/update using external_key
Service-->>Ingester: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (10)
src/implementations/warehouse_ingester.rb (1)
100-111: Minor: prefer structured logging and include context on errors.Printing to stdout loses structure and context. Consider using Logger with type context for easier ops triage.
Example:
- puts "[WarehouseIngester ERROR][#{@type}] #{e.class}: #{e}" + Logger.new($stdout).error("[WarehouseIngester][#{@type}] #{e.class}: #{e.message}")src/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rb (1)
25-31: Prefer logging backtrace for operational debugging.Capturing only e.message can hinder troubleshooting in prod.
- Logger.new($stdout).info(e.message) + Logger.new($stdout).error([e.message, *e.backtrace].join("\n"))src/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rb (1)
25-31: Include backtrace in error logs for parity and visibility.Same improvement as the Kommit Co script.
- Logger.new($stdout).info(e.message) + Logger.new($stdout).error([e.message, *e.backtrace].join("\n"))spec/services/postgres/github_repository_spec.rb (2)
89-99: Minor: assert query count pre/post delete for clarity.You already assert record gone; adding an explicit pre-count can make intent clearer, but optional.
102-116: Optional: add a test to prevent duplicates via external_github_repository_id.If you add a unique index in the migration, include a spec that attempts to insert the same external_github_repository_id twice and expects an error.
spec/services/postgres/test_db_helpers.rb (1)
238-263: Add a unique index on external_github_repository_id to prevent duplicates.This improves idempotency and mirrors typical warehouse constraints when upserting by external IDs.
If the migration also defines this index, mirror it here; if not, consider adding it there as well.
BigInt :external_github_repository_id, null: false + index :external_github_repository_id, unique: truesrc/use_cases_execution/schedules.rb (1)
154-161: Confirm concurrent 06:20 runs for both issues fetches.Two entries run at 06:20. If the orchestrator executes sequentially, fine; if not, consider staggering to reduce contention and rate-limits.
Example stagger:
- { path: "#{__dir__}/warehouse/github/fetch_kommitters_issues_from_github.rb", time: ['06:20'] }, + { path: "#{__dir__}/warehouse/github/fetch_kommitters_issues_from_github.rb", time: ['06:22'] },src/implementations/fetch_repositories_from_github.rb (1)
6-6: Simplify require_relative path.Not a blocker, but this is cleaner and less brittle.
-require_relative '../../src/utils/warehouse/github/repositories_formatter' +require_relative '../utils/warehouse/github/repositories_formatter'src/services/postgres/github_repository.rb (2)
48-52: Avoid puts for errors; use warn (or a logger) and keep raising.STDOUT noise in services is undesirable; warn is more appropriate and preserves behavior.
- puts "[GithubRepository Service ERROR] #{error.class}: #{error.message}" - puts "Backtrace: #{error.backtrace.join("\n")}" + warn "[GithubRepository Service ERROR] #{error.class}: #{error.message}" + warn "Backtrace:\n#{error.backtrace.join("\n")}" raise error
42-44: Add a convenience finder by external ID to align with ingestion flows.This reduces duplication at call sites that upsert by external_github_repository_id.
def query(conditions = {}) query_item(TABLE, conditions) end + + def find_by_external_id(external_id) + query_item(TABLE, external_github_repository_id: external_id).first + end
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
db/warehouse_migrations/20250923192219_create_github_repositories.rb(1 hunks)spec/services/postgres/github_repository_spec.rb(1 hunks)spec/services/postgres/test_db_helpers.rb(1 hunks)spec/use_cases_execution/warehouse/github/fetch_repositories_scripts_spec.rb(1 hunks)src/implementations/fetch_repositories_from_github.rb(1 hunks)src/implementations/warehouse_ingester.rb(2 hunks)src/services/postgres/github_repository.rb(1 hunks)src/use_cases_execution/schedules.rb(1 hunks)src/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rb(1 hunks)src/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rb(1 hunks)src/use_cases_execution/warehouse/warehouse_ingester.rb(1 hunks)src/utils/warehouse/github/base.rb(3 hunks)src/utils/warehouse/github/repositories_formatter.rb(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
src/use_cases_execution/*/**/*.rb
📄 CodeRabbit inference engine (CLAUDE.md)
src/use_cases_execution/*/**/*.rb: Use case orchestration files that wire implementations together should be placed in src/use_cases_execution/[use_case]/
All use cases must use Bas::SharedStorage with PostgreSQL as the coordination layer, with each pipeline step reading from and writing to the database using specific tags (FetchX, FormatX, NotifyX, GarbageCollector)
Files:
src/use_cases_execution/warehouse/warehouse_ingester.rbsrc/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rbsrc/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rb
**/*.rb
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.rb: Use HTTParty for HTTP requests
Do not use Rails conventions; use explicit imports for modules (e.g., Date, DateTime)
Follow Ruby conventions and best practices
Add relevant documentation comments to explain how users can use and understand the code
Add an empty line at the end of files
Use clear variable naming
Implement proper error handling
Files:
src/use_cases_execution/warehouse/warehouse_ingester.rbsrc/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rbspec/use_cases_execution/warehouse/github/fetch_repositories_scripts_spec.rbsrc/services/postgres/github_repository.rbsrc/utils/warehouse/github/repositories_formatter.rbsrc/utils/warehouse/github/base.rbspec/services/postgres/test_db_helpers.rbsrc/implementations/warehouse_ingester.rbsrc/implementations/fetch_repositories_from_github.rbsrc/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rbdb/warehouse_migrations/20250923192219_create_github_repositories.rbspec/services/postgres/github_repository_spec.rbsrc/use_cases_execution/schedules.rb
src/implementations/**/*.rb
📄 CodeRabbit inference engine (CLAUDE.md)
Reusable bot implementations should be placed in src/implementations/ and inherit from Bas::Bot::Base
Files:
src/implementations/warehouse_ingester.rbsrc/implementations/fetch_repositories_from_github.rb
src/use_cases_execution/schedules.rb
📄 CodeRabbit inference engine (CLAUDE.md)
src/use_cases_execution/schedules.rb: Centralized scheduling configuration should be placed in src/use_cases_execution/schedules.rb
Schedules must be defined using time, day, or interval keys in src/use_cases_execution/schedules.rb
Files:
src/use_cases_execution/schedules.rb
🧠 Learnings (5)
📚 Learning: 2025-07-21T22:59:03.954Z
Learnt from: CR
PR: kommitters/bas_use_cases#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-21T22:59:03.954Z
Learning: Applies to src/use_cases_execution/*/**/*.rb : All use cases must use Bas::SharedStorage with PostgreSQL as the coordination layer, with each pipeline step reading from and writing to the database using specific tags (FetchX, FormatX, NotifyX, GarbageCollector)
Applied to files:
src/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rbspec/use_cases_execution/warehouse/github/fetch_repositories_scripts_spec.rbsrc/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rb
📚 Learning: 2025-07-15T03:46:00.750Z
Learnt from: juanhiginio
PR: kommitters/bas_use_cases#158
File: spec/implementations/deploy_process_in_operaton/deploy_process_in_operaton_spec.rb:33-42
Timestamp: 2025-07-15T03:46:00.750Z
Learning: In the bas_use_cases repository, implementation test files in `spec/implementations/` follow a minimal testing pattern with a single test that verifies `execute` doesn't return nil, using mocked shared storage dependencies. This is the established convention across all implementation files rather than comprehensive edge case testing.
Applied to files:
spec/use_cases_execution/warehouse/github/fetch_repositories_scripts_spec.rb
📚 Learning: 2025-07-21T22:59:03.954Z
Learnt from: CR
PR: kommitters/bas_use_cases#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-21T22:59:03.954Z
Learning: Applies to src/implementations/**/*.rb : Reusable bot implementations should be placed in src/implementations/ and inherit from Bas::Bot::Base
Applied to files:
src/implementations/fetch_repositories_from_github.rb
📚 Learning: 2025-07-21T22:59:03.954Z
Learnt from: CR
PR: kommitters/bas_use_cases#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-21T22:59:03.954Z
Learning: Applies to src/use_cases_execution/schedules.rb : Centralized scheduling configuration should be placed in src/use_cases_execution/schedules.rb
Applied to files:
src/use_cases_execution/schedules.rb
📚 Learning: 2025-07-21T22:59:03.954Z
Learnt from: CR
PR: kommitters/bas_use_cases#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-21T22:59:03.954Z
Learning: Applies to src/use_cases_execution/schedules.rb : Schedules must be defined using time, day, or interval keys in src/use_cases_execution/schedules.rb
Applied to files:
src/use_cases_execution/schedules.rb
🧬 Code graph analysis (6)
src/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rb (2)
src/use_cases_execution/warehouse/config.rb (1)
kommit_co(69-75)src/implementations/fetch_records_from_work_logs.rb (1)
new(43-129)
src/services/postgres/github_repository.rb (1)
src/services/postgres/base.rb (6)
transaction(89-91)insert_item(62-72)update_item(74-83)delete_item(85-87)find_item(58-60)query_item(52-56)
src/utils/warehouse/github/repositories_formatter.rb (1)
src/utils/warehouse/github/base.rb (7)
extract_id(29-31)extract_name(37-39)extract_string(85-87)extract_boolean(77-79)extract_number(81-83)format_repository_owner_as_json(114-123)extract_created_at(45-47)
src/implementations/fetch_repositories_from_github.rb (2)
src/implementations/fetch_records_from_work_logs.rb (1)
new(43-129)src/utils/warehouse/github/repositories_formatter.rb (2)
format(12-37)format(16-36)
src/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rb (2)
src/use_cases_execution/warehouse/config.rb (1)
kommiters(61-67)src/implementations/fetch_records_from_work_logs.rb (1)
new(43-129)
spec/services/postgres/github_repository_spec.rb (2)
spec/services/postgres/test_db_helpers.rb (1)
create_github_repositories_table(238-262)src/services/postgres/github_repository.rb (5)
insert(18-22)find(38-40)update(24-30)delete(32-36)query(42-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Tests
🔇 Additional comments (11)
db/warehouse_migrations/20250923192219_create_github_repositories.rb (2)
6-7: Verify Sequel column types compatibility (BigInt/jsonb DSL).Sequel typically uses Bignum (or Integer size: 8) rather than BigInt, and column :owner, :jsonb rather than jsonb :owner, unless extensions adapt these shortcuts. Confirm your Sequel setup supports these type helpers, or switch to portable forms.
Optionally adjust:
- BigInt :external_github_repository_id, null: false + Bignum :external_github_repository_id, null: falseand
- jsonb :owner, null: true + column :owner, :jsonb, null: true
6-6: Ensure pgcrypto is enabled for gen_random_uuid().gen_random_uuid() requires the pgcrypto extension. Make sure it’s enabled in an earlier migration or enable it here before create_table.
Example (if needed at the top of up):
run 'CREATE EXTENSION IF NOT EXISTS pgcrypto;'src/use_cases_execution/warehouse/warehouse_ingester.rb (1)
24-27: LGTM: repositories fetch step included in the ingester scope.Adding FetchRepositoriesFromGithub to the allowed tags aligns the pipeline with the new ingestion flow.
src/implementations/warehouse_ingester.rb (1)
15-18: LGTM: service mapping for github_repository is correct.Service and external_key align with the migration and Postgres service.
src/utils/warehouse/github/repositories_formatter.rb (1)
16-35: LGTM: field mapping aligns with schema and helpers.Keys and defaults map cleanly to the migration; owner formatting delegated appropriately.
src/use_cases_execution/warehouse/github/fetch_kommit_co_repositories_from_github.rb (1)
9-15: Tag and filter setup looks correct for shared storage.Using tag 'FetchRepositoriesFromGithub' and archived=false matches the orchestration guidelines.
Consider removing require 'bas/shared_storage/default' if unused to keep dependencies minimal.
src/use_cases_execution/warehouse/github/fetch_kommitters_repositories_from_github.rb (1)
9-15: LGTM: shared storage orchestration and tag usage are consistent.Matches the Kommit Co script; correct tag and filter.
Remove require 'bas/shared_storage/default' if not needed.
spec/services/postgres/github_repository_spec.rb (2)
26-63: LGTM: solid CRUD coverage including JSON owner, flags, and counters.Happy path insert assertions are thorough and align with schema.
82-86: Good negative test on update preconditions.Explicitly asserting the ArgumentError message tightens contract.
src/services/postgres/github_repository.rb (1)
24-36: Base.save_history already guards against missing history tables
Base#save_history returns early unless history_enabled? (src/services/postgres/base.rb), so it will not write to github_repositories_history unless the service defines HISTORY_TABLE and HISTORY_FOREIGN_KEY.Likely an incorrect or invalid review comment.
spec/use_cases_execution/warehouse/github/fetch_repositories_scripts_spec.rb (1)
53-56: Don't change — :kommiters is correctConfig::Github defines self.kommiters (src/use_cases_execution/warehouse/config.rb:61) and multiple scripts/specs call it; replacing with :kommitters would be incorrect.
Likely an incorrect or invalid review comment.
cammv21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, i just leave you a comment. For the rest LGTM 🚀
cammv21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
Description
Add GitHub repositories ingestion scaffolding for the warehouse pipeline: database schema, and runner script tests.
What’s Changed
db/warehouse_migrations/20250923192219_create_github_repositories.rbdefininggithub_repositorieswith:external_github_repository_id,name,language,description,html_url,owner(jsonb)is_private,is_fork,is_archived,is_disabledwatchers_count,stargazers_count,forks_countcreation_timestamp, and DBcreated_at/updated_atcreate_github_repositories_tableinspec/services/postgres/test_db_helpers.rbmirroring the migration (for in-memory SQLite specs).Fixes #232
Type of change
Please delete options that are not relevant.
Checklist
Summary by CodeRabbit
New Features
Chores
Tests