-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Aggregation: GROUP BY and Aggregate Functions
Problem Statement
Users analyzing large JSON datasets need to compute statistics and summaries without exporting to external tools. Common analysis tasks like counting records, calculating averages, finding min/max, or grouping by categories are currently impossible within Thoth.
Use Cases:
- Analytics: Count orders by status, average revenue by category
- Monitoring: Find max response time per endpoint, count errors by type
- Data Quality: Count nulls, find duplicates, identify outliers
- Business Intelligence: Sum totals by region, average ratings by product
Proposed Solution
Add SQL-style aggregation with GROUP BY and aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.).
Aggregation Syntax
COUNT(*) # Count all records
COUNT(DISTINCT status) # Count unique values
SUM(price) # Sum numeric field
AVG(rating) # Average value
MIN(created_at), MAX(created_at) # Min/max values
GROUP BY category # Group by field
GROUP BY status, category # Multi-field grouping
Combined Query Example
status = "completed"
GROUP BY user.country
SELECT COUNT(*) as total, AVG(price) as avg_price, SUM(quantity) as total_qty
SORT BY total DESC
Implementation Design
Aggregate Functions
pub enum AggregateFunc {
Count, // COUNT(*) or COUNT(field)
CountDistinct, // COUNT(DISTINCT field)
Sum, // SUM(field)
Avg, // AVG(field)
Min, // MIN(field)
Max, // MAX(field)
First, // FIRST(field) - first non-null value
Last, // LAST(field) - last non-null value
StdDev, // STDDEV(field) - standard deviation
Median, // MEDIAN(field) - median value
}
pub struct AggregateExpr {
func: AggregateFunc,
field: Option<JsonPath>, // None for COUNT(*)
alias: String, // Result column name
}Group By Configuration
pub struct GroupByConfig {
fields: Vec<JsonPath>, // Fields to group by
aggregates: Vec<AggregateExpr>, // Aggregate functions
}
pub struct AggregateResult {
groups: HashMap<GroupKey, AggregateValues>,
total_count: usize,
}
// Example result:
// {
// "US": { "total": 150, "avg_price": 49.99, "total_qty": 523 },
// "UK": { "total": 83, "avg_price": 52.10, "total_qty": 291 },
// }Aggregation Algorithm
Phase 1: Scan and Group
// Parallel grouping with Rayon
let groups: HashMap<GroupKey, Vec<Value>> = records
.par_iter()
.fold(
|| HashMap::new(),
|mut acc, record| {
let key = extract_group_key(record, &config.fields);
acc.entry(key).or_insert_with(Vec::new).push(record.clone());
acc
}
)
.reduce(
|| HashMap::new(),
|mut a, b| {
for (key, mut values) in b {
a.entry(key).or_insert_with(Vec::new).append(&mut values);
}
a
}
);Phase 2: Compute Aggregates
fn compute_aggregates(group: &[Value], exprs: &[AggregateExpr]) -> AggregateValues {
let mut results = HashMap::new();
for expr in exprs {
let value = match expr.func {
AggregateFunc::Count => group.len() as f64,
AggregateFunc::Sum => {
group.iter()
.filter_map(|v| extract_number(v, &expr.field))
.sum()
}
AggregateFunc::Avg => {
let values: Vec<f64> = group.iter()
.filter_map(|v| extract_number(v, &expr.field))
.collect();
values.iter().sum::<f64>() / values.len() as f64
}
AggregateFunc::Min => {
group.iter()
.filter_map(|v| extract_comparable(v, &expr.field))
.min()
.cloned()
}
// ... other functions
};
results.insert(expr.alias.clone(), value);
}
results
}Performance Optimizations
- Parallel Grouping: Use Rayon's fold/reduce pattern
- Streaming Aggregation: For simple cases (no GROUP BY), compute in one pass
- Approximate Aggregates: For huge datasets, sample first N records
- Index-Assisted: Use value indices if available
- Incremental Updates: Recompute only changed groups
UI Integration
Results Display
Option 1: Table View
┌──────────────────────────────────────────────────────┐
│ GROUP BY: category │
├──────────────┬─────────┬────────────┬────────────────┤
│ Category │ Count │ Avg Price │ Total Revenue │
├──────────────┼─────────┼────────────┼────────────────┤
│ Electronics │ 1,523 │ $249.99 │ $380,734.77 │
│ Clothing │ 4,891 │ $45.20 │ $221,073.20 │
│ Books │ 892 │ $18.50 │ $16,502.00 │
└──────────────┴─────────┴────────────┴────────────────┘
Option 2: Card View (for few groups)
┌────────────────────────────────┐
│ Electronics │
│ ────────────────────────────── │
│ Count: 1,523 │
│ Avg Price: $249.99 │
│ Total Revenue: $380,734.77 │
│ Min: $9.99 Max: $1,999.99 │
└────────────────────────────────┘
Option 3: Visualization (future)
- Bar chart for counts
- Pie chart for distribution
- Line chart for trends over time
Query Builder UI
┌─────────────────────────────────────────────────────┐
│ 📊 Aggregation │
│ ─────────────────────────────────────────────────── │
│ Group By: [category ▼] [+ Add Field] │
│ │
│ Aggregates: │
│ 1. [COUNT ▼] (*) as [total ] [✕] │
│ 2. [AVG ▼] [price ▼] as [avg_price ] [✕]│
│ 3. [SUM ▼] [revenue▼] as [total_rev ] [✕]│
│ │
│ [+ Add Aggregate] │
│ │
│ [Apply] [Export CSV] [Clear] │
└─────────────────────────────────────────────────────┘
Example Queries
// Count all records
"COUNT(*)"
// Count by status
"GROUP BY status SELECT COUNT(*) as total"
// Average price per category
"GROUP BY category SELECT AVG(price) as avg_price"
// Multiple aggregates
"GROUP BY user.country
SELECT COUNT(*) as orders,
SUM(total) as revenue,
AVG(items.length) as avg_items"
// With filtering
"status = \"shipped\"
GROUP BY carrier
SELECT COUNT(*) as shipments, AVG(delivery_days) as avg_days
SORT BY shipments DESC"
// Count distinct values
"SELECT COUNT(DISTINCT user_id) as unique_users"
// Min/Max with grouping
"GROUP BY category
SELECT MIN(price) as min_price, MAX(price) as max_price"
// Time-based grouping (future)
"GROUP BY DATE_TRUNC('day', created_at)
SELECT COUNT(*) as daily_count"Acceptance Criteria
- Implement aggregate functions: COUNT, SUM, AVG, MIN, MAX
- Support COUNT(*) and COUNT(DISTINCT field)
- Parse GROUP BY with single and multiple fields
- Parse SELECT with aggregate expressions and aliases
- Handle nested field paths in aggregates
- Type-aware aggregation (skip non-numeric for SUM/AVG)
- Null handling (skip nulls in calculations)
- Combine with filtering (WHERE clause)
- Display results in table format
- Export aggregated results to CSV/JSON
- Parallel grouping with Rayon
- Maintain performance (<5s for 100K records)
- Add comprehensive tests
- Document aggregation syntax
Edge Cases to Handle
- Empty groups: No records match filter
- Return empty result set
- All nulls: AVG([null, null, null])
- Return null or 0? (configurable)
- Division by zero: AVG of empty group
- Return null
- Mixed types: SUM(["100", 100, "abc"])
- Skip non-numeric, warn user
- Nested arrays: GROUP BY items[*].category
- Flatten or error? (start with error, flatten in future)
- Huge groups: 1M unique values
- Warn user, suggest sampling
- Memory exhaustion: Grouping creates too many groups
- Streaming aggregation or external grouping
Performance Targets
- Simple aggregates (no GROUP BY): <100ms for 100K records
- Single-field GROUP BY (<100 groups): <500ms for 100K records
- Multi-field GROUP BY (<1K groups): <2s for 100K records
- High cardinality (10K+ groups): <10s for 100K records
Related Features
- Issue Advanced Filtering: Database-style Query Operations #53: Advanced filtering (WHERE clause)
- Issue Sorting: Multi-Column Sort with Custom Comparators #54: Sorting (ORDER BY for aggregate results)
- Future: JOINs across multiple files
Future Enhancements
-
HAVING clause: Filter groups after aggregation
GROUP BY category SELECT COUNT(*) as total HAVING total > 100 -
Percentiles:
PERCENTILE(field, 90)- 90th percentile -
Mode:
MODE(field)- most common value -
String aggregation:
STRING_AGG(name, ', ')- concatenate values -
Window functions:
ROW_NUMBER(),RANK(),LAG(),LEAD() -
Time-based grouping:
DATE_TRUNC('hour', timestamp) -
Approximate aggregates: HyperLogLog for DISTINCT, T-Digest for percentiles
-
Pivot tables: Transform groups into columns
-
Visualizations: Built-in charts for aggregate results
Non-Goals (Out of Scope)
- Nested aggregations (aggregates of aggregates)
- Recursive aggregation
- User-defined aggregate functions
- Distributed aggregation
- Real-time/incremental aggregation
References
- SQL Aggregate Functions
- jq group_by
- PostgreSQL Aggregate Functions
- DataFrame operations (pandas inspiration)