Skip to content

Aggregation: GROUP BY and Aggregate Functions #55

@anitnilay20

Description

@anitnilay20

Aggregation: GROUP BY and Aggregate Functions

Problem Statement

Users analyzing large JSON datasets need to compute statistics and summaries without exporting to external tools. Common analysis tasks like counting records, calculating averages, finding min/max, or grouping by categories are currently impossible within Thoth.

Use Cases:

  • Analytics: Count orders by status, average revenue by category
  • Monitoring: Find max response time per endpoint, count errors by type
  • Data Quality: Count nulls, find duplicates, identify outliers
  • Business Intelligence: Sum totals by region, average ratings by product

Proposed Solution

Add SQL-style aggregation with GROUP BY and aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.).

Aggregation Syntax

COUNT(*)                              # Count all records
COUNT(DISTINCT status)                # Count unique values
SUM(price)                            # Sum numeric field
AVG(rating)                           # Average value
MIN(created_at), MAX(created_at)      # Min/max values
GROUP BY category                     # Group by field
GROUP BY status, category             # Multi-field grouping

Combined Query Example

status = "completed" 
  GROUP BY user.country 
  SELECT COUNT(*) as total, AVG(price) as avg_price, SUM(quantity) as total_qty
  SORT BY total DESC

Implementation Design

Aggregate Functions

pub enum AggregateFunc {
    Count,              // COUNT(*) or COUNT(field)
    CountDistinct,      // COUNT(DISTINCT field)
    Sum,                // SUM(field)
    Avg,                // AVG(field)
    Min,                // MIN(field)
    Max,                // MAX(field)
    First,              // FIRST(field) - first non-null value
    Last,               // LAST(field) - last non-null value
    StdDev,             // STDDEV(field) - standard deviation
    Median,             // MEDIAN(field) - median value
}

pub struct AggregateExpr {
    func: AggregateFunc,
    field: Option<JsonPath>,    // None for COUNT(*)
    alias: String,               // Result column name
}

Group By Configuration

pub struct GroupByConfig {
    fields: Vec<JsonPath>,              // Fields to group by
    aggregates: Vec<AggregateExpr>,     // Aggregate functions
}

pub struct AggregateResult {
    groups: HashMap<GroupKey, AggregateValues>,
    total_count: usize,
}

// Example result:
// {
//   "US": { "total": 150, "avg_price": 49.99, "total_qty": 523 },
//   "UK": { "total": 83, "avg_price": 52.10, "total_qty": 291 },
// }

Aggregation Algorithm

Phase 1: Scan and Group

// Parallel grouping with Rayon
let groups: HashMap<GroupKey, Vec<Value>> = records
    .par_iter()
    .fold(
        || HashMap::new(),
        |mut acc, record| {
            let key = extract_group_key(record, &config.fields);
            acc.entry(key).or_insert_with(Vec::new).push(record.clone());
            acc
        }
    )
    .reduce(
        || HashMap::new(),
        |mut a, b| {
            for (key, mut values) in b {
                a.entry(key).or_insert_with(Vec::new).append(&mut values);
            }
            a
        }
    );

Phase 2: Compute Aggregates

fn compute_aggregates(group: &[Value], exprs: &[AggregateExpr]) -> AggregateValues {
    let mut results = HashMap::new();
    
    for expr in exprs {
        let value = match expr.func {
            AggregateFunc::Count => group.len() as f64,
            AggregateFunc::Sum => {
                group.iter()
                    .filter_map(|v| extract_number(v, &expr.field))
                    .sum()
            }
            AggregateFunc::Avg => {
                let values: Vec<f64> = group.iter()
                    .filter_map(|v| extract_number(v, &expr.field))
                    .collect();
                values.iter().sum::<f64>() / values.len() as f64
            }
            AggregateFunc::Min => {
                group.iter()
                    .filter_map(|v| extract_comparable(v, &expr.field))
                    .min()
                    .cloned()
            }
            // ... other functions
        };
        results.insert(expr.alias.clone(), value);
    }
    
    results
}

Performance Optimizations

  1. Parallel Grouping: Use Rayon's fold/reduce pattern
  2. Streaming Aggregation: For simple cases (no GROUP BY), compute in one pass
  3. Approximate Aggregates: For huge datasets, sample first N records
  4. Index-Assisted: Use value indices if available
  5. Incremental Updates: Recompute only changed groups

UI Integration

Results Display

Option 1: Table View

┌──────────────────────────────────────────────────────┐
│ GROUP BY: category                                   │
├──────────────┬─────────┬────────────┬────────────────┤
│ Category     │ Count   │ Avg Price  │ Total Revenue │
├──────────────┼─────────┼────────────┼────────────────┤
│ Electronics  │ 1,523   │ $249.99    │ $380,734.77   │
│ Clothing     │ 4,891   │ $45.20     │ $221,073.20   │
│ Books        │ 892     │ $18.50     │ $16,502.00    │
└──────────────┴─────────┴────────────┴────────────────┘

Option 2: Card View (for few groups)

┌────────────────────────────────┐
│ Electronics                    │
│ ────────────────────────────── │
│ Count: 1,523                   │
│ Avg Price: $249.99             │
│ Total Revenue: $380,734.77     │
│ Min: $9.99  Max: $1,999.99     │
└────────────────────────────────┘

Option 3: Visualization (future)

  • Bar chart for counts
  • Pie chart for distribution
  • Line chart for trends over time

Query Builder UI

┌─────────────────────────────────────────────────────┐
│ 📊 Aggregation                                      │
│ ─────────────────────────────────────────────────── │
│ Group By: [category     ▼] [+ Add Field]           │
│                                                     │
│ Aggregates:                                         │
│  1. [COUNT      ▼] (*) as [total         ]  [✕]    │
│  2. [AVG        ▼] [price ▼] as [avg_price  ]  [✕]│
│  3. [SUM        ▼] [revenue▼] as [total_rev ]  [✕]│
│                                                     │
│ [+ Add Aggregate]                                   │
│                                                     │
│ [Apply]  [Export CSV]  [Clear]                     │
└─────────────────────────────────────────────────────┘

Example Queries

// Count all records
"COUNT(*)"

// Count by status
"GROUP BY status SELECT COUNT(*) as total"

// Average price per category
"GROUP BY category SELECT AVG(price) as avg_price"

// Multiple aggregates
"GROUP BY user.country 
 SELECT COUNT(*) as orders, 
        SUM(total) as revenue, 
        AVG(items.length) as avg_items"

// With filtering
"status = \"shipped\" 
 GROUP BY carrier 
 SELECT COUNT(*) as shipments, AVG(delivery_days) as avg_days
 SORT BY shipments DESC"

// Count distinct values
"SELECT COUNT(DISTINCT user_id) as unique_users"

// Min/Max with grouping
"GROUP BY category 
 SELECT MIN(price) as min_price, MAX(price) as max_price"

// Time-based grouping (future)
"GROUP BY DATE_TRUNC('day', created_at) 
 SELECT COUNT(*) as daily_count"

Acceptance Criteria

  • Implement aggregate functions: COUNT, SUM, AVG, MIN, MAX
  • Support COUNT(*) and COUNT(DISTINCT field)
  • Parse GROUP BY with single and multiple fields
  • Parse SELECT with aggregate expressions and aliases
  • Handle nested field paths in aggregates
  • Type-aware aggregation (skip non-numeric for SUM/AVG)
  • Null handling (skip nulls in calculations)
  • Combine with filtering (WHERE clause)
  • Display results in table format
  • Export aggregated results to CSV/JSON
  • Parallel grouping with Rayon
  • Maintain performance (<5s for 100K records)
  • Add comprehensive tests
  • Document aggregation syntax

Edge Cases to Handle

  • Empty groups: No records match filter
    • Return empty result set
  • All nulls: AVG([null, null, null])
    • Return null or 0? (configurable)
  • Division by zero: AVG of empty group
    • Return null
  • Mixed types: SUM(["100", 100, "abc"])
    • Skip non-numeric, warn user
  • Nested arrays: GROUP BY items[*].category
    • Flatten or error? (start with error, flatten in future)
  • Huge groups: 1M unique values
    • Warn user, suggest sampling
  • Memory exhaustion: Grouping creates too many groups
    • Streaming aggregation or external grouping

Performance Targets

  • Simple aggregates (no GROUP BY): <100ms for 100K records
  • Single-field GROUP BY (<100 groups): <500ms for 100K records
  • Multi-field GROUP BY (<1K groups): <2s for 100K records
  • High cardinality (10K+ groups): <10s for 100K records

Related Features

Future Enhancements

  • HAVING clause: Filter groups after aggregation

    GROUP BY category 
      SELECT COUNT(*) as total 
      HAVING total > 100
    
  • Percentiles: PERCENTILE(field, 90) - 90th percentile

  • Mode: MODE(field) - most common value

  • String aggregation: STRING_AGG(name, ', ') - concatenate values

  • Window functions: ROW_NUMBER(), RANK(), LAG(), LEAD()

  • Time-based grouping: DATE_TRUNC('hour', timestamp)

  • Approximate aggregates: HyperLogLog for DISTINCT, T-Digest for percentiles

  • Pivot tables: Transform groups into columns

  • Visualizations: Built-in charts for aggregate results

Non-Goals (Out of Scope)

  • Nested aggregations (aggregates of aggregates)
  • Recursive aggregation
  • User-defined aggregate functions
  • Distributed aggregation
  • Real-time/incremental aggregation

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfeatureNew feature requestsperformancePerformance and scalability improvementspriority:mediumMedium priority itemssize:largeLarge effort (1+ months)

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions