Aggregation: GROUP BY and Aggregate Functions

# Aggregation: GROUP BY and Aggregate Functions

## Problem Statement

Users analyzing large JSON datasets need to compute statistics and summaries without exporting to external tools. Common analysis tasks like counting records, calculating averages, finding min/max, or grouping by categories are currently impossible within Thoth.

**Use Cases:**
- **Analytics**: Count orders by status, average revenue by category
- **Monitoring**: Find max response time per endpoint, count errors by type
- **Data Quality**: Count nulls, find duplicates, identify outliers
- **Business Intelligence**: Sum totals by region, average ratings by product

## Proposed Solution

Add SQL-style aggregation with GROUP BY and aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.).

### Aggregation Syntax

```
COUNT(*)                              # Count all records
COUNT(DISTINCT status)                # Count unique values
SUM(price)                            # Sum numeric field
AVG(rating)                           # Average value
MIN(created_at), MAX(created_at)      # Min/max values
GROUP BY category                     # Group by field
GROUP BY status, category             # Multi-field grouping
```

### Combined Query Example

```
status = "completed" 
  GROUP BY user.country 
  SELECT COUNT(*) as total, AVG(price) as avg_price, SUM(quantity) as total_qty
  SORT BY total DESC
```

## Implementation Design

### Aggregate Functions

```rust
pub enum AggregateFunc {
    Count,              // COUNT(*) or COUNT(field)
    CountDistinct,      // COUNT(DISTINCT field)
    Sum,                // SUM(field)
    Avg,                // AVG(field)
    Min,                // MIN(field)
    Max,                // MAX(field)
    First,              // FIRST(field) - first non-null value
    Last,               // LAST(field) - last non-null value
    StdDev,             // STDDEV(field) - standard deviation
    Median,             // MEDIAN(field) - median value
}

pub struct AggregateExpr {
    func: AggregateFunc,
    field: Option<JsonPath>,    // None for COUNT(*)
    alias: String,               // Result column name
}
```

### Group By Configuration

```rust
pub struct GroupByConfig {
    fields: Vec<JsonPath>,              // Fields to group by
    aggregates: Vec<AggregateExpr>,     // Aggregate functions
}

pub struct AggregateResult {
    groups: HashMap<GroupKey, AggregateValues>,
    total_count: usize,
}

// Example result:
// {
//   "US": { "total": 150, "avg_price": 49.99, "total_qty": 523 },
//   "UK": { "total": 83, "avg_price": 52.10, "total_qty": 291 },
// }
```

### Aggregation Algorithm

**Phase 1: Scan and Group**
```rust
// Parallel grouping with Rayon
let groups: HashMap<GroupKey, Vec<Value>> = records
    .par_iter()
    .fold(
        || HashMap::new(),
        |mut acc, record| {
            let key = extract_group_key(record, &config.fields);
            acc.entry(key).or_insert_with(Vec::new).push(record.clone());
            acc
        }
    )
    .reduce(
        || HashMap::new(),
        |mut a, b| {
            for (key, mut values) in b {
                a.entry(key).or_insert_with(Vec::new).append(&mut values);
            }
            a
        }
    );
```

**Phase 2: Compute Aggregates**
```rust
fn compute_aggregates(group: &[Value], exprs: &[AggregateExpr]) -> AggregateValues {
    let mut results = HashMap::new();
    
    for expr in exprs {
        let value = match expr.func {
            AggregateFunc::Count => group.len() as f64,
            AggregateFunc::Sum => {
                group.iter()
                    .filter_map(|v| extract_number(v, &expr.field))
                    .sum()
            }
            AggregateFunc::Avg => {
                let values: Vec<f64> = group.iter()
                    .filter_map(|v| extract_number(v, &expr.field))
                    .collect();
                values.iter().sum::<f64>() / values.len() as f64
            }
            AggregateFunc::Min => {
                group.iter()
                    .filter_map(|v| extract_comparable(v, &expr.field))
                    .min()
                    .cloned()
            }
            // ... other functions
        };
        results.insert(expr.alias.clone(), value);
    }
    
    results
}
```

### Performance Optimizations

1. **Parallel Grouping**: Use Rayon's fold/reduce pattern
2. **Streaming Aggregation**: For simple cases (no GROUP BY), compute in one pass
3. **Approximate Aggregates**: For huge datasets, sample first N records
4. **Index-Assisted**: Use value indices if available
5. **Incremental Updates**: Recompute only changed groups

## UI Integration

### Results Display

**Option 1: Table View**
```
┌──────────────────────────────────────────────────────┐
│ GROUP BY: category                                   │
├──────────────┬─────────┬────────────┬────────────────┤
│ Category     │ Count   │ Avg Price  │ Total Revenue │
├──────────────┼─────────┼────────────┼────────────────┤
│ Electronics  │ 1,523   │ $249.99    │ $380,734.77   │
│ Clothing     │ 4,891   │ $45.20     │ $221,073.20   │
│ Books        │ 892     │ $18.50     │ $16,502.00    │
└──────────────┴─────────┴────────────┴────────────────┘
```

**Option 2: Card View (for few groups)**
```
┌────────────────────────────────┐
│ Electronics                    │
│ ────────────────────────────── │
│ Count: 1,523                   │
│ Avg Price: $249.99             │
│ Total Revenue: $380,734.77     │
│ Min: $9.99  Max: $1,999.99     │
└────────────────────────────────┘
```

**Option 3: Visualization (future)**
- Bar chart for counts
- Pie chart for distribution
- Line chart for trends over time

### Query Builder UI

```
┌─────────────────────────────────────────────────────┐
│ 📊 Aggregation                                      │
│ ─────────────────────────────────────────────────── │
│ Group By: [category     ▼] [+ Add Field]           │
│                                                     │
│ Aggregates:                                         │
│  1. [COUNT      ▼] (*) as [total         ]  [✕]    │
│  2. [AVG        ▼] [price ▼] as [avg_price  ]  [✕]│
│  3. [SUM        ▼] [revenue▼] as [total_rev ]  [✕]│
│                                                     │
│ [+ Add Aggregate]                                   │
│                                                     │
│ [Apply]  [Export CSV]  [Clear]                     │
└─────────────────────────────────────────────────────┘
```

## Example Queries

```rust
// Count all records
"COUNT(*)"

// Count by status
"GROUP BY status SELECT COUNT(*) as total"

// Average price per category
"GROUP BY category SELECT AVG(price) as avg_price"

// Multiple aggregates
"GROUP BY user.country 
 SELECT COUNT(*) as orders, 
        SUM(total) as revenue, 
        AVG(items.length) as avg_items"

// With filtering
"status = \"shipped\" 
 GROUP BY carrier 
 SELECT COUNT(*) as shipments, AVG(delivery_days) as avg_days
 SORT BY shipments DESC"

// Count distinct values
"SELECT COUNT(DISTINCT user_id) as unique_users"

// Min/Max with grouping
"GROUP BY category 
 SELECT MIN(price) as min_price, MAX(price) as max_price"

// Time-based grouping (future)
"GROUP BY DATE_TRUNC('day', created_at) 
 SELECT COUNT(*) as daily_count"
```

## Acceptance Criteria

- [ ] Implement aggregate functions: COUNT, SUM, AVG, MIN, MAX
- [ ] Support COUNT(*) and COUNT(DISTINCT field)
- [ ] Parse GROUP BY with single and multiple fields
- [ ] Parse SELECT with aggregate expressions and aliases
- [ ] Handle nested field paths in aggregates
- [ ] Type-aware aggregation (skip non-numeric for SUM/AVG)
- [ ] Null handling (skip nulls in calculations)
- [ ] Combine with filtering (WHERE clause)
- [ ] Display results in table format
- [ ] Export aggregated results to CSV/JSON
- [ ] Parallel grouping with Rayon
- [ ] Maintain performance (<5s for 100K records)
- [ ] Add comprehensive tests
- [ ] Document aggregation syntax

## Edge Cases to Handle

- **Empty groups**: No records match filter
  - Return empty result set
- **All nulls**: AVG([null, null, null])
  - Return null or 0? (configurable)
- **Division by zero**: AVG of empty group
  - Return null
- **Mixed types**: SUM(["100", 100, "abc"])
  - Skip non-numeric, warn user
- **Nested arrays**: GROUP BY items[*].category
  - Flatten or error? (start with error, flatten in future)
- **Huge groups**: 1M unique values
  - Warn user, suggest sampling
- **Memory exhaustion**: Grouping creates too many groups
  - Streaming aggregation or external grouping

## Performance Targets

- **Simple aggregates** (no GROUP BY): <100ms for 100K records
- **Single-field GROUP BY** (<100 groups): <500ms for 100K records
- **Multi-field GROUP BY** (<1K groups): <2s for 100K records
- **High cardinality** (10K+ groups): <10s for 100K records

## Related Features

- Issue #53: Advanced filtering (WHERE clause)
- Issue #54: Sorting (ORDER BY for aggregate results)
- Future: JOINs across multiple files

## Future Enhancements

- **HAVING clause**: Filter groups after aggregation
  ```
  GROUP BY category 
    SELECT COUNT(*) as total 
    HAVING total > 100
  ```

- **Percentiles**: `PERCENTILE(field, 90)` - 90th percentile

- **Mode**: `MODE(field)` - most common value

- **String aggregation**: `STRING_AGG(name, ', ')` - concatenate values

- **Window functions**: `ROW_NUMBER()`, `RANK()`, `LAG()`, `LEAD()`

- **Time-based grouping**: `DATE_TRUNC('hour', timestamp)`

- **Approximate aggregates**: HyperLogLog for DISTINCT, T-Digest for percentiles

- **Pivot tables**: Transform groups into columns

- **Visualizations**: Built-in charts for aggregate results

## Non-Goals (Out of Scope)

- Nested aggregations (aggregates of aggregates)
- Recursive aggregation
- User-defined aggregate functions
- Distributed aggregation
- Real-time/incremental aggregation

## References

- [SQL Aggregate Functions](https://www.sqlite.org/lang_aggfunc.html)
- [jq group_by](https://stedolan.github.io/jq/manual/#group_by(path_expression))
- [PostgreSQL Aggregate Functions](https://www.postgresql.org/docs/current/functions-aggregate.html)
- [DataFrame operations](https://pandas.pydata.org/docs/reference/groupby.html) (pandas inspiration)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggregation: GROUP BY and Aggregate Functions #55

Aggregation: GROUP BY and Aggregate Functions

Problem Statement

Proposed Solution

Aggregation Syntax

Combined Query Example

Implementation Design

Aggregate Functions

Group By Configuration

Aggregation Algorithm

Performance Optimizations

UI Integration

Results Display

Query Builder UI

Example Queries

Acceptance Criteria

Edge Cases to Handle

Performance Targets

Related Features

Future Enhancements

Non-Goals (Out of Scope)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Aggregation: GROUP BY and Aggregate Functions #55

Description

Aggregation: GROUP BY and Aggregate Functions

Problem Statement

Proposed Solution

Aggregation Syntax

Combined Query Example

Implementation Design

Aggregate Functions

Group By Configuration

Aggregation Algorithm

Performance Optimizations

UI Integration

Results Display

Query Builder UI

Example Queries

Acceptance Criteria

Edge Cases to Handle

Performance Targets

Related Features

Future Enhancements

Non-Goals (Out of Scope)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions