Skip to content

Add dataset discovery tool #24

@MarkSpectarium

Description

@MarkSpectarium

Feature Description

Add a describe_dataset tool that helps agents understand what data is available and how files relate to each other before querying.

Problem: Agents currently need to call multiple tools (get_schema, find_relationships, sample_data) to understand a dataset before they can formulate queries. This creates cognitive load and errors.

API Design

describe_dataset({
  directory: "LiveTest"
})

Expected Output

{
  "directory": "LiveTest",
  "files": [
    {
      "name": "chars.json",
      "sizeMB": 54.5,
      "entityCount": 210,
      "entityType": "PlayerCharacter",
      "keyFields": ["entityId", "payload.playerId", "payload.character.characterClassId"],
      "sampleValues": {
        "characterClassId": ["ChacChel_Class", "Thor_Class"]
      }
    },
    {
      "name": "live.json",
      "sizeMB": 4.8,
      "entityCount": 127,
      "entityType": "Player",
      "keyFields": ["entityId", "payload.playerLevel", "payload.totalIapSpend"],
      "metricsAvailable": ["loginHistory", "deviceHistory", "stats.createdAt"]
    }
  ],
  "relationships": [
    {
      "description": "chars.playerId -> live.entityId",
      "leftFile": "chars.json",
      "leftKey": "payload.playerId",
      "rightFile": "live.json", 
      "rightKey": "entityId",
      "coverage": "100%",
      "type": "many-to-one"
    }
  ],
  "suggestedQueries": [
    "Group by characterClassId to compare player segments",
    "Use loginHistory for retention analysis",
    "Join on playerId <-> entityId for cross-file queries"
  ]
}

Implementation Details

Existing Code to Reuse

  1. Schema extraction - json_genius/src/analyzer/schema-extractor.ts

    • extractSchema(filePath, options) - Returns SchemaNode with type info, patterns, examples
    • Already detects $type fields which indicate entity type
  2. Relationship detection - json_genius/src/analyzer/relationship-finder.ts

    • findRelationships(leftFile, rightFile, options) - Returns RelationshipResult
    • Scans for ID fields using patterns: /Id$/, /entityId/i, etc.
    • Already computes coverage percentage and relationship type
  3. Entity counting - json_genius/src/query/aggregate.ts

    • count(filePath, options) - Returns { total, scanned }
  4. Tool pattern - json_genius/src/mcp/tools.ts

    • Follow existing pattern: add tool to tools array, create handleDescribeDataset function
    • Use validateFile() for path resolution (handles relative paths via dataDir)
    • Return via successResult() / errorResult() helpers

New File: json_genius/src/analyzer/dataset-discovery.ts

export interface FileInfo {
  name: string;
  sizeMB: number;
  entityCount: number;
  entityType?: string;
  keyFields: string[];
  sampleValues: Record<string, string[]>;
  metricsAvailable?: string[];
}

export interface DatasetDescription {
  directory: string;
  files: FileInfo[];
  relationships: RelationshipSummary[];
  suggestedQueries: string[];
}

export async function describeDataset(
  directory: string
): Promise<DatasetDescription>

Implementation Steps

  1. Scan directory for JSON files - Use fs.readdir + filter for .json
  2. For each file:
    • Get file size via fs.stat
    • Count entities using existing count() from aggregate.ts
    • Extract schema using extractSchema() - look for $type to get entity type
    • Identify key fields from schema (fields ending in Id, Ids, or containing entity ID patterns)
    • Sample unique values for categorical fields (use schema examples or light sampling)
  3. Find relationships between all file pairs - Use findRelationships() between each pair
  4. Generate suggested queries based on:
    • Groupable fields (enums, categorical strings with few unique values)
    • Numeric fields (for stats)
    • Detected relationships (for joins)

Files to Modify

  1. json_genius/src/mcp/tools.ts - Add describe_dataset tool definition and handler
  2. New: json_genius/src/analyzer/dataset-discovery.ts - Core discovery logic

Success Criteria

  • Single tool call provides complete dataset overview
  • Relationships auto-detected between files
  • Key fields identified for grouping/joining
  • Sample values shown for categorical fields
  • Suggested queries help agents get started
  • Works with LiveTest directory (if available) or any directory with JSON files
  • Follows existing code patterns (streaming where appropriate, proper error handling)
  • TypeScript compiles without errors (pnpm typecheck)

Created from TD analysis of issue #21 - Priority 3: Reduce agent cognitive load

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions