Skip to content

Conversation

@ajroetker
Copy link
Contributor

Addresses #2243 and ports #2242 to this new aggregations style as well.

(feat) Add aggregations framework to enable analytics on search results

Enable powerful analytics and data exploration capabilities that go beyond
simple faceting. Users can now compute metrics (sum, avg, min, max, count,
sumsquares, stats) across search results and group them by field values or
ranges with nested sub-aggregations for multi-dimensional analysis.

Problems addressed:
- Computing statistics across filtered result sets (e.g., "average price of
  products matching 'laptop'")
- Multi-level grouping and metrics (e.g., "total sales per region per category")
- Complex analytics queries without requiring separate aggregation passes

Notes:
- Metric aggregations: sum, avg, min, max, count, sumsquares, stats
- Bucket aggregations: terms (group by values), range (group by ranges)
- Nested sub-aggregations for multi-dimensional analytics
- Computed efficiently during query execution using visitor pattern
- Fully backward compatible - Facets API unchanged
  
(feat) Add prefix and regex filtering to terms aggregations (port of https://github.com/blevesearch/bleve/pull/2242)

Enable search-as-you-type style aggregations where bucket terms dynamically
match user input. Users can now aggregate by field values that match what's
being typed in a search box, making autosuggestions cleaner and more focused
(e.g., as user types "ste", show matching authors, titles, categories all
filtered to terms starting with "ste").

Problems addressed:
- Dynamic faceted autosuggestions that update as users type
- Filtering high-cardinality fields to relevant matches only
- Consistent filtering API between facets and aggregations (ports existing
  facet filtering feature)

Notes:
- Add TermPrefix and TermPattern fields to AggregationRequest
- Pre-compile regex patterns in NewTermsAggregation (now returns error)
- Add NewTermsAggregationWithFilter helper

Enable powerful analytics and data exploration capabilities that go beyond
simple faceting. Users can now compute metrics (sum, avg, min, max, count,
sumsquares, stats) across search results and group them by field values or
ranges with nested sub-aggregations for multi-dimensional analysis.

This addresses the need for:
- Computing statistics across filtered result sets (e.g., "average price of
  products matching 'laptop'")
- Multi-level grouping and metrics (e.g., "total sales per region per category")
- Complex analytics queries without requiring separate aggregation passes

Key features:
- Metric aggregations: sum, avg, min, max, count, sumsquares, stats
- Bucket aggregations: terms (group by values), range (group by ranges)
- Nested sub-aggregations for multi-dimensional analytics
- Computed efficiently during query execution using visitor pattern
- Fully backward compatible - Facets API unchanged

Example - average price per brand:
  byBrand := bleve.NewTermsAggregation("brand", 10)
  byBrand.AddSubAggregation("avg_price", bleve.NewAggregationRequest("avg", "price"))
  searchRequest.Aggregations = bleve.AggregationsRequest{"by_brand": byBrand}
Enable search-as-you-type style aggregations where bucket terms dynamically
match user input. Users can now aggregate by field values that match what's
being typed in a search box, making autosuggestions cleaner and more focused
(e.g., as user types "ste", show matching authors, titles, categories all
filtered to terms starting with "ste").

This addresses the need for:
- Dynamic faceted autosuggestions that update as users type
- Filtering high-cardinality fields to relevant matches only
- Consistent filtering API between facets and aggregations (ports existing
  facet filtering feature)

Performance benefits:
- Zero-allocation filtering - only matching terms convert from []byte to string
- Filters apply before bucket creation and sub-aggregation processing
- Fast prefix checks with bytes.HasPrefix before regex evaluation

Key changes:
- Add TermPrefix and TermPattern fields to AggregationRequest
- Pre-compile regex patterns in NewTermsAggregation (now returns error)
- Add NewTermsAggregationWithFilter helper

Example - autocomplete aggregation:
  agg, _ := bleve.NewTermsAggregationWithFilter("brand", 10, userInput, "")
@ajroetker
Copy link
Contributor Author

I promise this is my last big change @abhinavdangeti
I've been saving these up for a while on local branches and wanted to contribute back upstream!
I've been experimenting with how to use SIMD to accelerate substring matching and regex matching in zapx code as well, and have been meaning to look into async io (https://github.com/Iceber/iouring-go) integration too (was curious if y'all had already played around with that or not).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive aggregations framework to enable numeric analytics and data exploration on search results. The implementation adds metric aggregations (sum, avg, min, max, count, sumsquares, stats) and bucket aggregations (terms, range) with support for nested sub-aggregations. Additionally, it ports the prefix and regex filtering feature from PR #2242 to enable dynamic term filtering in aggregations.

Key changes include:

  • New aggregation API with AggregationRequest and AggregationsRequest types that integrate seamlessly with the existing SearchRequest
  • Visitor pattern-based implementation that computes aggregations during query execution with zero additional I/O overhead
  • Support for multi-level nested aggregations enabling complex analytical queries

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
search_no_knn.go Adds Aggregations field to SearchRequest for non-KNN searches
search_knn.go Adds Aggregations field to SearchRequest for KNN-enabled searches
search.go Defines AggregationRequest, AggregationsRequest types and helper constructors; adds Aggregations field to SearchResult
search/collector/topn.go Integrates aggregations into the collector with SetAggregationsBuilder, field deduplication, and visitor callbacks
search/aggregations_builder.go Implements the core AggregationsBuilder that manages multiple aggregation builders and coordinates field visits
search/aggregations_builder_test.go Unit tests for AggregationResults.Merge functionality covering various aggregation types
search/aggregation/numeric_aggregation.go Implements metric aggregations (sum, avg, min, max, count, sumsquares, stats) with proper numeric decoding
search/aggregation/numeric_aggregation_test.go Comprehensive unit tests for all metric aggregations including edge cases
search/aggregation/bucket_aggregation.go Implements bucket aggregations (terms, range) with sub-aggregation support and term filtering
search/aggregation/optimized_numeric_aggregation.go Provides infrastructure for segment-level optimization (currently placeholder with bugs in implementation)
index/scorch/segment_aggregation_stats.go Implements segment-level statistics caching for future optimizations
index_impl.go Adds buildAggregation function to convert AggregationRequest to AggregationBuilder and wires aggregations into search execution
aggregation_test.go Integration tests for metric aggregations using real index
bucket_aggregation_test.go Integration tests for bucket aggregations with sub-aggregations
docs/aggregations.md Comprehensive documentation covering architecture, API, examples, and performance characteristics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ajroetker ajroetker force-pushed the ajroetker/add-bleve-aggregations branch from bad5c51 to 8a44955 Compare November 13, 2025 17:38
Fixes bug in nested bucket aggregations where metric values were
duplicated due to duplicate field registration in SubAggregationFields().
Also fixes StartDoc/EndDoc lifecycle for bucket sub-aggregations and
min/max comparison logic in optimized aggregations.

Adds Clone() method to AggregationBuilder interface for proper deep
copying of nested aggregation hierarchies. Adopts setter pattern for
aggregation filters (SetPrefixFilter, SetRegexFilter).
@ajroetker ajroetker force-pushed the ajroetker/add-bleve-aggregations branch from 8a44955 to 5723569 Compare November 13, 2025 17:45
@ajroetker ajroetker requested a review from Copilot November 13, 2025 19:11
Copilot finished reviewing on behalf of ajroetker November 13, 2025 19:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@abhinavdangeti abhinavdangeti added this to the v2.6.0 milestone Nov 13, 2025
@abhinavdangeti
Copy link
Member

Thanks @ajroetker , allow us to review your work here.
We want to consider this and #2242 for our next major release.

AJ Roetker added 2 commits November 13, 2025 14:23
- Fix double-counting in bucket aggregations with sawValue guard
- Remove unused count fields from Sum and SumSquares aggregations
- Move StatsResult to search package for cleaner stats merging
- Add field deduplication and validation for term filters
Also properly adds support for average for merging
@ajroetker
Copy link
Contributor Author

@abhinavdangeti I've also got implementations for histograms, data histograms, geo hashing buckets, geo distance buckets, and cardinality (via hyperloglog++ sketches) if it would be more helpful to include them or leave them to a later PR for consideration.

@ajroetker ajroetker requested a review from Copilot November 21, 2025 18:42
Copilot finished reviewing on behalf of ajroetker November 21, 2025 18:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +133 to +140
ranges := []*bleve.numericRange{
{Name: "low", Min: nil, Max: &mid},
{Name: "medium", Min: &mid, Max: &max},
{Name: "high", Min: &max, Max: nil},
}

agg := bleve.NewRangeAggregation("price", ranges)
```
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type bleve.numericRange is not exported (it's lowercase). The correct type reference in the documentation should be the actual internal type name. Since this is user-facing documentation, consider providing a clearer example that doesn't reference the unexported type directly, or note that users should use the helper functions provided.

Copilot uses AI. Check for mistakes.
ajroetker and others added 2 commits November 25, 2025 10:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants