A robust and extensible SDK for ingesting data from various external sources such as Git repositories, Google Drive, and HTTP endpoints.
Designed to be stateless, the crawlers integrate seamlessly with an external Scheduler SDK for orchestration and task management.
- Introduction
- Features
- Architecture Overview
- Installation
- Usage
- Crawler Types
- Webhook Processing & Utilities
- Testing
- License
The Crawler SDK is a collection of specialized data source components built with TypeScript and Node.js. Its primary function is to efficiently extract and standardize data from diverse external systems. Designed with statelessness in mind, these crawlers are intended to be invoked by an external Scheduler SDK, which manages task orchestration, scheduling, and persistent state.
This SDK provides modular, testable, and extensible components for integrating various data sources into a larger data ingestion pipeline.
- Multi-Source Data Ingestion: Supports Git repositories, Google Drive, and HTTP/HTTPS endpoints.
- Stateless Design: Crawlers are stateless, making them highly scalable and resilient.
- Full Data Scans: Capability to perform complete initial data extractions.
- Incremental Data Syncs (Delta Sync): For supported sources (e.g., Google Drive), fetches only changes since the last successful ingestion using continuation tokens.
- Webhook-Driven Triggers: Processes incoming webhooks (e.g., GitHub push events, Google Drive change notifications) to trigger targeted ingestion.
- Flexible Configuration: Configurable via YAML files (for DataSource initialization) and dynamic payloads (for task execution).
- Standardized Output: Transforms extracted data into a consistent
IngestionDataformat. - Temporary Local Storage: Git operations utilize temporary local paths, ensuring no persistent local data is left behind.
- Robust Error Handling & Logging: Detailed logging and graceful error management during all stages.
The Crawler SDK operates as a set of stateless workers within a larger, loosely coupled architecture. It is designed to be invoked directly by an external Scheduler SDK.
-
Statelessness: Each crawler instance created for a task execution is ephemeral. It does not maintain any long-term internal state. All necessary information (configuration, authentication, continuation tokens, webhook payloads) is passed to it at the point of invocation.
-
Single Responsibility: Each crawler (
git-crawler.ts,gdrive-crawler.ts,http-crawler.ts) focuses solely on data extraction from its specific source. -
Separation of Concerns: Webhook management (registration/deregistration) and payload processing are handled by dedicated utilities within the SDK, often invoked by the Scheduler SDK.
-
The Scheduler SDK (the orchestrator) determines when a task needs to run (e.g., cron schedule, webhook event, manual trigger).
-
The Scheduler SDK retrieves the full
taskDefinition(including source configuration, authentication details, and latest continuation tokens) from its persistent database. -
The Scheduler SDK directly instantiates a new instance of the relevant Crawler SDK component (e.g.,
new GitCrawlerDataSource(...)) and invokes itsexecute()method with a comprehensiveinitialPayload. -
The invoked Crawler SDK instance performs the data extraction from the external source based only on the provided payload.
-
Upon completion, the Crawler SDK returns the
IngestionDataand aGSStatus(including any updated continuation tokens) back to the Scheduler SDK. -
The Scheduler SDK then persists these updated tokens and the task status in its database for future runs.
This architecture ensures that the Crawler SDK remains lightweight, highly scalable, and resilient, as any instance can process any valid task request without needing prior history or complex state synchronization.
To use the Crawler SDK, you need to set up your Node.js environment and install the required dependencies.
- Node.js (LTS version recommended)
- npm (Node Package Manager) or pnpm
Navigate to your project's root directory (where package.json is located) in your terminal and install the dependencies:
# Install runtime dependencies
npm install @godspeedsystems/core simple-git axios uuid googleapis google-auth-library cheerio
# Install development dependencies (for testing, TypeScript compilation)
npm install --save-dev jest ts-jest @types/jest ts-node @types/axios @types/cheerioThe Crawler SDK components are designed to be invoked by an external orchestrator (like the Scheduler SDK).
Your Scheduler SDK will typically:
- Load a
taskDefinitionfrom its database. - Based on
taskDefinition.source.pluginType, dynamically determine which DataSource class to use (e.g.,git-crawler.ts'sDataSource). - Create an instance:
const crawlerInstance = new DataSource({ config: taskDefinition.source.config });
Initialize the crawler's client:
await crawlerInstance.initClient();Execute the crawl:
const result = await crawlerInstance.execute(context, initialPayload);Process result.data.data (the IngestionData) and result.data.startPageToken / nextPageToken for persistence.
Each crawler type is configured via a GitCrawlerConfig, GoogleDriveCrawlerConfig, or HttpCrawlerConfig object. These configurations are typically defined in YAML files (e.g., config/datasources/git-crawler.yaml) and are used primarily for the initialization of the DataSource type when the system starts up. They serve as a reference and guide for developers to understand the available configuration parameters for each crawler.
Crucially, users do not have to explicitly create a separate YAML file for each new data source instance or task they register. Instead:
- When an Administrator/User defines a new ingestion task (e.g., via an API call to the Scheduler SDK),
- The specific configuration for that task's source is provided as a JSON object within the
taskDefinition. - This
taskDefinitionis then persisted in the database.
Example: Git Crawler (datasources/git-crawler.yaml)
type: git-crawler
branch: main
depth: 1Example: Google Drive Crawler (datasources/gdrive-crawler.yaml)
type: googledrive-crawler
authType: service_account
serviceAccountKeyPath: ./path/to/service-account-key.json
pageSize: 100Example: HTTP Crawler (datasources/http-crawler.yaml)
type: http-crawler
maxDepth: 2
recursiveCrawling: true
allowedDomains:
- example.com- All crawlers transform their extracted data into a consistent
IngestionDatainterface, making it easy for downstream processing.
export interface IngestionData {
id: string;
content: string | Buffer | object;
metadata?: {
filename?: string;
relativePath?: string;
changeType?: 'UPSERT' | 'DELETE' | 'UNKNOWN' | 'full_scan';
};
fetchedAt?: Date;
url?: string;
statusCode?: number;
}Purpose: Ingests code and metadata from Git repositories.
Configuration (GitCrawlerConfig):
repoUrl(mandatory)branchdepth
Authentication:
Relies on the Git repository's accessibility (public, or private via SSH/HTTPS credentials handled by simple-git's environment setup, often managed by the Scheduler SDK providing a configured simple-git instance or environment variables). Webhook registration uses a GitHub Personal Access Token (PAT) via git-api-utils.ts.
Functionality:
-
Full Scan:
Clones the entire repository to a temporary directory (os.tmpdir() + uuidv4()), reads all files, and then deletes the temporary directory. -
Webhook Trigger:
On a GitHub push webhook, it clones/pulls to a temporary directory, performs agit reset --hardto the specific commit, reads the added and modified files from the webhook'shead_commitpayload, and marks removed files. The temporary directory is then deleted. -
Graceful Handling:
Logs warnings for missinghead_commitin webhooks or file read errors during full scans/webhooks, but does not crash.
Purpose:
Ingests files and changes from Google Drive folders.
Configuration (GoogleDriveCrawlerConfig):
folderId(mandatory)authType(service_account)serviceAccountKeyPathorserviceAccountKey(JSON key content)pageSize
Authentication:
Uses Google Service Account (JWT) for secure server-to-server authentication.
Functionality:
-
Full Scan:
- Lists all files in the specified
folderId - Fetches their metadata and content
- Supports resumable scans using
nextPageToken
- Lists all files in the specified
-
Webhook Trigger (Delta Sync):
- Triggered by a Google Drive webhook
- Utilizes a
startPageToken(provided in the invocation payload) to call Google Drive'schanges.list()API - Fetches only incremental changes
- Fetches content for added or modified files
- For removed or trashed files, records
DELETEchangeTypemetadata without downloading content
-
Token Management:
- Returns the
newStartPageTokenfromchanges.listto the Scheduler for persistence - Ensures the next sync starts from the correct point
- Returns the
Purpose:
Crawls data from publicly accessible web pages and resources.
Configuration (HttpCrawlerConfig):
startUrl(mandatory)maxDepthrecursiveCrawlingallowedDomainsexcludePathsincludePathsrequestTimeoutMsuserAgentheaders
Authentication:
Typically handles public resources; for authenticated resources, credentials would be managed via headers in the config.
Functionality:
- Uses
axiosfor HTTP/HTTPS requests. - If
recursiveCrawlingis enabled, it parses HTML content usingcheerioto extract links (a[href]) and follows them up tomaxDepth, respecting domain and path filters.
Output:
Provides fetched content and metadata including url, statusCode, contentType.
These components facilitate the interaction between the Scheduler SDK and external services for webhook management.
-
Role:
A central utility invoked by the Scheduler SDK's webhook listener. It's responsible for the initial parsing and then full validation of incoming webhook payloads. -
Validation:
- For GitHub, it verifies
X-Hub-Signatureheaders against a stored secret. - For Google Drive, it validates
x-goog-channel-idagainst the storedexternalWebhookId.
- For GitHub, it verifies
-
Payload Extraction:
Extracts the relevantexternalResourceId(e.g., Git repo URL, GDrive folder ID) andchangeType(e.g., UPSERT, DELETE).
It also gracefully handles GitHub "ping" events.
-
Role:
Acts as a generic dispatcher. The Scheduler SDK calls itsregisterWebhook()orderegisterWebhook()methods. -
Dispatching:
Based on thepluginType(e.g.,git-crawler,googledrive-crawler), it routes the request to the appropriate service-specific utility (e.g.,git-api-utils.ts,gdrive-api-utils.ts).
-
Role:
Contains the specific API calls to Google Drive for managing webhook subscriptions. -
Functionality:
registerWebhook()– callsdrive.files.watch(), returns Google'sexternalId(channel ID),channelResourceId, andstartPageToken.deregisterWebhook()– callsdrive.channels.stop().- Uses Service Account (JWT) for authentication.
-
Role:
Contains the specific API calls to GitHub for managing webhook subscriptions. -
Functionality:
registerWebhook()– calls GitHub's hooks API, returns GitHub's webhookidasexternalId.deregisterWebhook()– calls GitHub's delete hook API.- Uses a Personal Access Token (PAT) for authentication.
The Crawler SDK emphasizes unit testing to ensure the reliability and correctness of individual components.
- Framework: Jest is used as the primary testing framework.
- TypeScript Integration:
ts-jestenables running tests written in TypeScript. - Mocking: External dependencies (like
simple-git,fs/promises,axios,googleapis,os,uuid, and even internal logger) are extensively mocked usingjest.mock()andjest.spyOn(). This ensures tests are isolated, deterministic, and fast, without making real network calls or file system operations. - Test Structure: Tests are organized in a
test/directory, mirroring thesrc/structure (e.g.,test/datasources/types/git-crawler.test.ts).
To run the tests:
# Ensure all dependencies and devDependencies are installed
npm install
# Configure jest.config.js in your project root
# Run the test command
npm test
# or
npm run test
This project is licensed under the ISC License – see LICENSE for details.