Skip to content

Add aggregate/group_by support to join_files #23

@MarkSpectarium

Description

@MarkSpectarium

Feature Description

Add group_by and aggregate parameters to join_files to enable cross-dataset analytics without returning raw records.

Use Cases:

  • "D1 retention by character class"
  • "Average player level by character class"
  • "Total IAP spend by country"
  • "Session count by device type"
  • Any cross-dataset grouping + aggregation

Currently requires 4-5 tool calls + manual calculations. With this feature: 1 tool call.

API Design

join_files({
  file1: "chars.json",
  file2: "live.json",
  leftKey: "payload.playerId",
  rightKey: "entityId",
  group_by: "a.payload.character.characterClassId",
  aggregate: {
    "player_count": "COUNT(*)",
    "avg_level": "AVG(b.payload.playerLevel)",
    "total_spend": "SUM(b.payload.totalIapSpend)",
    "d1_retained": "COUNT_IF(D1_RETENTION(b.payload.loginHistory))"
  }
})

Implementation Constraints

  1. Explicit keys required - No auto-inference for group_by or aggregate
  2. Backwards compatible - Without group_by/aggregate, behaves as before (returns raw joined records)
  3. Side prefixes required - a. for left file, b. for right file (existing pattern)
  4. Standard aggregate functions - COUNT, SUM, AVG, MIN, MAX, COUNT_IF
  5. Retention helper - D1_RETENTION(), D7_RETENTION() functions for login history analysis

Expected Output

{
  "groups": {
    "ChacChel_Class": {
      "player_count": 45,
      "avg_level": 12.3,
      "total_spend": 156.50,
      "d1_retained": 22
    },
    "Thor_Class": {
      "player_count": 82,
      "avg_level": 14.1,
      "total_spend": 289.00,
      "d1_retained": 44
    }
  },
  "totals": {
    "player_count": 127,
    "avg_level": 13.4,
    "total_spend": 445.50,
    "d1_retained": 66
  },
  "joinKeys": { "leftKey": "payload.playerId", "rightKey": "entityId" }
}

Files to Modify

  1. json_genius/src/query/join.ts - Add group_by and aggregate logic
  2. json_genius/src/mcp/tools.ts - Update join_files input schema
  3. New file: json_genius/src/query/aggregators.ts - Aggregate function implementations

Algorithm

  1. Build right-side index (existing)
  2. Stream left file and join (existing)
  3. NEW: If group_by specified:
    • Extract group key from joined record
    • Accumulate aggregate values per group
  4. NEW: Calculate final aggregate values (AVG = sum/count, etc.)
  5. Return grouped results + totals

Aggregate Functions

Function Description
COUNT(*) Count of records in group
COUNT(path) Count of non-null values at path
COUNT_IF(condition) Count where condition is true
SUM(path) Sum of numeric values
AVG(path) Average of numeric values
MIN(path) Minimum value
MAX(path) Maximum value
D1_RETENTION(loginHistoryPath) True if player returned after 1+ days
D7_RETENTION(loginHistoryPath) True if player returned after 7+ days

Success Criteria

  • group_by parameter groups joined results by specified field
  • aggregate parameter calculates metrics per group
  • totals always included alongside groups
  • Backwards compatible (no group_by = raw records as before)
  • Retention helpers work with loginHistory arrays
  • Works with LiveTest data (chars.json + live.json)

Created from TD analysis of issue #21 - Priority 2: General cross-dataset aggregation

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions