-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
swarm:reviewReady for reviewReady for review
Description
Feature Description
Add a describe_dataset tool that helps agents understand what data is available and how files relate to each other before querying.
Problem: Agents currently need to call multiple tools (get_schema, find_relationships, sample_data) to understand a dataset before they can formulate queries. This creates cognitive load and errors.
API Design
describe_dataset({
directory: "LiveTest"
})Expected Output
{
"directory": "LiveTest",
"files": [
{
"name": "chars.json",
"sizeMB": 54.5,
"entityCount": 210,
"entityType": "PlayerCharacter",
"keyFields": ["entityId", "payload.playerId", "payload.character.characterClassId"],
"sampleValues": {
"characterClassId": ["ChacChel_Class", "Thor_Class"]
}
},
{
"name": "live.json",
"sizeMB": 4.8,
"entityCount": 127,
"entityType": "Player",
"keyFields": ["entityId", "payload.playerLevel", "payload.totalIapSpend"],
"metricsAvailable": ["loginHistory", "deviceHistory", "stats.createdAt"]
}
],
"relationships": [
{
"description": "chars.playerId -> live.entityId",
"leftFile": "chars.json",
"leftKey": "payload.playerId",
"rightFile": "live.json",
"rightKey": "entityId",
"coverage": "100%",
"type": "many-to-one"
}
],
"suggestedQueries": [
"Group by characterClassId to compare player segments",
"Use loginHistory for retention analysis",
"Join on playerId <-> entityId for cross-file queries"
]
}Implementation Details
Existing Code to Reuse
-
Schema extraction -
json_genius/src/analyzer/schema-extractor.tsextractSchema(filePath, options)- ReturnsSchemaNodewith type info, patterns, examples- Already detects
$typefields which indicate entity type
-
Relationship detection -
json_genius/src/analyzer/relationship-finder.tsfindRelationships(leftFile, rightFile, options)- ReturnsRelationshipResult- Scans for ID fields using patterns:
/Id$/,/entityId/i, etc. - Already computes coverage percentage and relationship type
-
Entity counting -
json_genius/src/query/aggregate.tscount(filePath, options)- Returns{ total, scanned }
-
Tool pattern -
json_genius/src/mcp/tools.ts- Follow existing pattern: add tool to
toolsarray, createhandleDescribeDatasetfunction - Use
validateFile()for path resolution (handles relative paths viadataDir) - Return via
successResult()/errorResult()helpers
- Follow existing pattern: add tool to
New File: json_genius/src/analyzer/dataset-discovery.ts
export interface FileInfo {
name: string;
sizeMB: number;
entityCount: number;
entityType?: string;
keyFields: string[];
sampleValues: Record<string, string[]>;
metricsAvailable?: string[];
}
export interface DatasetDescription {
directory: string;
files: FileInfo[];
relationships: RelationshipSummary[];
suggestedQueries: string[];
}
export async function describeDataset(
directory: string
): Promise<DatasetDescription>Implementation Steps
- Scan directory for JSON files - Use
fs.readdir+ filter for.json - For each file:
- Get file size via
fs.stat - Count entities using existing
count()from aggregate.ts - Extract schema using
extractSchema()- look for$typeto get entity type - Identify key fields from schema (fields ending in
Id,Ids, or containing entity ID patterns) - Sample unique values for categorical fields (use schema
examplesor light sampling)
- Get file size via
- Find relationships between all file pairs - Use
findRelationships()between each pair - Generate suggested queries based on:
- Groupable fields (enums, categorical strings with few unique values)
- Numeric fields (for stats)
- Detected relationships (for joins)
Files to Modify
json_genius/src/mcp/tools.ts- Adddescribe_datasettool definition and handler- New:
json_genius/src/analyzer/dataset-discovery.ts- Core discovery logic
Success Criteria
- Single tool call provides complete dataset overview
- Relationships auto-detected between files
- Key fields identified for grouping/joining
- Sample values shown for categorical fields
- Suggested queries help agents get started
- Works with LiveTest directory (if available) or any directory with JSON files
- Follows existing code patterns (streaming where appropriate, proper error handling)
- TypeScript compiles without errors (
pnpm typecheck)
Created from TD analysis of issue #21 - Priority 3: Reduce agent cognitive load
Metadata
Metadata
Assignees
Labels
swarm:reviewReady for reviewReady for review