Skip to content

Conversation

ahkcs
Copy link
Contributor

@ahkcs ahkcs commented Jul 14, 2025

Summary

This PR implements a complete separation of the bin command's span functionality from aggregation span logic, ensuring backward compatibility while introducing robust SPL-compatible binning behavior. This includes support for parameters such as span, bins, minspan, start, end, and aligntime.


Related Issues

Resolves #3876

Test Results

Tests were run on the following real datasets:

  • accounts.json: 1000 records, age values 20–40, balance values ~1000–50000
  • time_test_data.json: 100 records, timestamps from 2025-07-28 to 2025-08-01

1. span Parameter (Numeric)

Query:

source=accounts | bin age span=10 | stats count() by age | head 5

Result:

[
  [504, "30-40"],
  [451, "20-30"],
  [45,  "40-50"]
]

2. bins Parameter

Query:

source=accounts | bin age bins=3 | stats count() by age | head 5

Result:

[
  [504, "30-40"],
  [451, "20-30"],
  [45,  "40-50"]
]

3. minspan Parameter

Query:

source=accounts | bin balance minspan=5000 | stats count() by balance | head 5

Result:

[
  [187, "30000-40000"],
  [215, "40000-50000"],
  [213, "10000-20000"],
  [168, "0-10000"],
  [217, "20000-30000"]
]

4. start and end Parameters

Query:

source=accounts | bin balance span=10000 start=5000 end=45000 | stats count() by balance | head 5

Result:

[
  [187, "30000-40000"],
  [215, "40000-50000"],
  [213, "10000-20000"],
  [168, "0-10000"],
  [217, "20000-30000"]
]

Logical Plan:


5. aligntime="@d" Parameter

Query:

source=time_test | bin @timestamp span=1hour aligntime=\"@d+3h\" | fields @timestamp | head 5

Result:

 ["2025-07-28 00:00:00"],
 ["2025-07-28 01:00:00"],
 ["2025-07-28 02:00:00"],
 ["2025-07-28 03:00:00"],
 ["2025-07-28 04:00:00"]

Bin Performance Test

=== BIN COMMAND PERFORMANCE TEST (SIMPLIFIED) ===
Testing pure bin operations on big5 index

=== NUMERIC BIN OPERATIONS (metrics.size field) ===
🧪 Testing: Bin metrics.size with span=500
📝 Query: source=big5 | bin `metrics.size` span=500 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 16.05ms
   ✅ Iteration 2: 15.63ms
   ✅ Iteration 3: 15.68ms
   ✅ Iteration 4: 15.55ms
   ✅ Iteration 5: 15.73ms
📊 Average time: 15.72ms (5/5 successful)

🧪 Testing: Bin metrics.size with span=1000
📝 Query: source=big5 | bin `metrics.size` span=1000 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 16.09ms
   ✅ Iteration 2: 16.42ms
   ✅ Iteration 3: 16.18ms
   ✅ Iteration 4: 16.02ms
   ✅ Iteration 5: 16.06ms
📊 Average time: 16.15ms (5/5 successful)

🧪 Testing: Bin metrics.size with bins=5
📝 Query: source=big5 | bin `metrics.size` bins=5 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.93ms
   ✅ Iteration 2: 15.99ms
   ✅ Iteration 3: 15.74ms
   ✅ Iteration 4: 15.73ms
   ✅ Iteration 5: 15.95ms
📊 Average time: 15.86ms (5/5 successful)

🧪 Testing: Bin metrics.size with minspan=100
📝 Query: source=big5 | bin `metrics.size` minspan=100 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 16.59ms
   ✅ Iteration 2: 16.24ms
   ✅ Iteration 3: 16.40ms
   ✅ Iteration 4: 16.02ms
   ✅ Iteration 5: 16.35ms
📊 Average time: 16.32ms (5/5 successful)

🧪 Testing: Basic bin metrics.size span=200
📝 Query: source=big5 | bin `metrics.size` span=200 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.52ms
   ✅ Iteration 2: 15.70ms
   ✅ Iteration 3: 15.50ms
   ✅ Iteration 4: 15.36ms
   ✅ Iteration 5: 15.28ms
📊 Average time: 15.47ms (5/5 successful)

🧪 Testing: Bin metrics.size with span=5000
📝 Query: source=big5 | bin `metrics.size` span=5000 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.57ms
   ✅ Iteration 2: 15.37ms
   ✅ Iteration 3: 15.25ms
   ✅ Iteration 4: 15.37ms
   ✅ Iteration 5: 15.23ms
📊 Average time: 15.35ms (5/5 successful)

🧪 Testing: Bin metrics.size field only
📝 Query: source=big5 | bin `metrics.size` span=2000 | fields `metrics.size` | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 22.18ms
   ✅ Iteration 2: 21.70ms
   ✅ Iteration 3: 21.37ms
   ✅ Iteration 4: 21.50ms
   ✅ Iteration 5: 21.46ms
📊 Average time: 21.64ms (5/5 successful)

🧪 Testing: Bin metrics.size with bins=5 start=0 end=10000
📝 Query: source=big5 | bin `metrics.size` bins=5 start=0 end=10000 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.66ms
   ✅ Iteration 2: 15.86ms
   ✅ Iteration 3: 15.73ms
   ✅ Iteration 4: 15.83ms
   ✅ Iteration 5: 15.75ms
📊 Average time: 15.76ms (5/5 successful)

🧪 Testing: Bin metrics.size with bins=10 start=0 end=20000
📝 Query: source=big5 | bin `metrics.size` bins=10 start=0 end=20000 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.85ms
   ✅ Iteration 2: 15.72ms
   ✅ Iteration 3: 15.57ms
   ✅ Iteration 4: 15.66ms
   ✅ Iteration 5: 15.95ms
📊 Average time: 15.75ms (5/5 successful)

=== TIMESTAMP BIN OPERATIONS ===
🧪 Testing: Bin @timestamp with span=1h
📝 Query: source=big5 | bin `@timestamp` span=1h | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 23.84ms
   ✅ Iteration 2: 23.19ms
   ✅ Iteration 3: 23.99ms
   ✅ Iteration 4: 23.74ms
   ✅ Iteration 5: 23.97ms
📊 Average time: 23.74ms (5/5 successful)

🧪 Testing: Bin @timestamp with span=4h
📝 Query: source=big5 | bin `@timestamp` span=4h | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 24.18ms
   ✅ Iteration 2: 23.99ms
   ✅ Iteration 3: 23.66ms
   ✅ Iteration 4: 24.13ms
   ✅ Iteration 5: 24.03ms
📊 Average time: 23.99ms (5/5 successful)

🧪 Testing: Bin @timestamp span=1d
📝 Query: source=big5 | bin `@timestamp` span=1d | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 24.99ms
   ✅ Iteration 2: 25.33ms
   ✅ Iteration 3: 24.90ms
   ✅ Iteration 4: 26.67ms
   ✅ Iteration 5: 25.53ms
📊 Average time: 25.48ms (5/5 successful)

🧪 Testing: Bin @timestamp with span=4mon
📝 Query: source=big5 | bin `@timestamp` span=4mon as cate | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 26.23ms
   ✅ Iteration 2: 25.94ms
   ✅ Iteration 3: 26.20ms
   ✅ Iteration 4: 25.89ms
   ✅ Iteration 5: 25.63ms
📊 Average time: 25.97ms (5/5 successful)

🧪 Testing: Bin metrics.tmin with span=50
📝 Query: source=big5 | bin `metrics.tmin` span=50 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 14.95ms
   ✅ Iteration 2: 15.03ms
   ✅ Iteration 3: 15.17ms
   ✅ Iteration 4: 15.21ms
   ✅ Iteration 5: 15.14ms
📊 Average time: 15.10ms (5/5 successful)

🧪 Testing: Bin metrics.size with bins=2
📝 Query: source=big5 | bin `metrics.size` bins=2 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.47ms
   ✅ Iteration 2: 15.17ms
   ✅ Iteration 3: 15.08ms
   ✅ Iteration 4: 15.06ms
   ✅ Iteration 5: 15.29ms
📊 Average time: 15.21ms (5/5 successful)

🧪 Testing: Bin metrics.size with bins=21
📝 Query: source=big5 | bin `metrics.size` bins=21 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.26ms
   ✅ Iteration 2: 14.91ms
   ✅ Iteration 3: 14.85ms
   ✅ Iteration 4: 14.88ms
   ✅ Iteration 5: 14.97ms
📊 Average time: 14.97ms (5/5 successful)

🧪 Testing: Bin metrics.size with bins=49
📝 Query: source=big5 | bin `metrics.size` bins=49 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.19ms
   ✅ Iteration 2: 14.98ms
   ✅ Iteration 3: 15.04ms
   ✅ Iteration 4: 15.19ms
   ✅ Iteration 5: 15.19ms
📊 Average time: 15.11ms (5/5 successful)

🧪 Testing: Bin metrics.tmin with minspan=1001
📝 Query: source=big5 | bin `metrics.tmin` minspan=1001 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 14.99ms
   ✅ Iteration 2: 15.23ms
   ✅ Iteration 3: 14.85ms
   ✅ Iteration 4: 14.98ms
   ✅ Iteration 5: 15.20ms
📊 Average time: 15.05ms (5/5 successful)

🧪 Testing: Bin metrics.tmin start=0 end=1001
📝 Query: source=big5 | bin `metrics.tmin` start=0 end=1001 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 16.04ms
   ✅ Iteration 2: 15.67ms
   ✅ Iteration 3: 15.64ms
   ✅ Iteration 4: 15.91ms
   ✅ Iteration 5: 15.29ms
📊 Average time: 15.71ms (5/5 successful)

=== LOGARITHMIC BIN OPERATIONS ===
🧪 Testing: Bin metrics.size with span=log10
📝 Query: source=big5 | bin `metrics.size` span=log10 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.32ms
   ✅ Iteration 2: 15.45ms
   ✅ Iteration 3: 15.53ms
   ✅ Iteration 4: 15.61ms
   ✅ Iteration 5: 15.67ms
📊 Average time: 15.51ms (5/5 successful)

🧪 Testing: Bin metrics.size with span=2log10
📝 Query: source=big5 | bin `metrics.size` span=2log10 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.78ms
   ✅ Iteration 2: 15.70ms
   ✅ Iteration 3: 15.28ms
   ✅ Iteration 4: 15.17ms
   ✅ Iteration 5: 14.89ms
📊 Average time: 15.36ms (5/5 successful)

🧪 Testing: Bin metrics.size with span=log2
📝 Query: source=big5 | bin `metrics.size` span=log2 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 14.82ms
   ✅ Iteration 2: 14.94ms
   ✅ Iteration 3: 14.95ms
   ✅ Iteration 4: 15.29ms
   ✅ Iteration 5: 15.16ms
📊 Average time: 15.03ms (5/5 successful)

🧪 Testing: Bin metrics.size with span=1.5log10
📝 Query: source=big5 | bin `metrics.size` span=1.5log10 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.79ms
   ✅ Iteration 2: 16.58ms
   ✅ Iteration 3: 15.73ms
   ✅ Iteration 4: 15.76ms
   ✅ Iteration 5: 15.45ms
📊 Average time: 15.86ms (5/5 successful)

🧪 Testing: Bin metrics.size with span=1.5log3
📝 Query: source=big5 | bin `metrics.size` span=1.5log3 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.14ms
   ✅ Iteration 2: 14.84ms
   ✅ Iteration 3: 15.77ms
   ✅ Iteration 4: 15.83ms
   ✅ Iteration 5: 15.50ms
📊 Average time: 15.41ms (5/5 successful)

=== TIME UNIT BIN OPERATIONS ===
🧪 Testing: Bin @timestamp with span=30seconds
📝 Query: source=big5 | bin `@timestamp` span=30seconds | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 24.36ms
   ✅ Iteration 2: 23.53ms
   ✅ Iteration 3: 23.82ms
   ✅ Iteration 4: 24.15ms
   ✅ Iteration 5: 23.46ms
📊 Average time: 23.86ms (5/5 successful)

🧪 Testing: Bin @timestamp with span=45minute
📝 Query: source=big5 | bin `@timestamp` span=45minute | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 23.36ms
   ✅ Iteration 2: 23.18ms
   ✅ Iteration 3: 23.69ms
   ✅ Iteration 4: 23.13ms
   ✅ Iteration 5: 22.86ms
📊 Average time: 23.24ms (5/5 successful)

🧪 Testing: Bin @timestamp with span=7day
📝 Query: source=big5 | bin `@timestamp` span=7day | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 24.87ms
   ✅ Iteration 2: 25.26ms
   ✅ Iteration 3: 25.58ms
   ✅ Iteration 4: 25.08ms
   ✅ Iteration 5: 25.05ms
📊 Average time: 25.16ms (5/5 successful)

🧪 Testing: Bin @timestamp with span=6day
📝 Query: source=big5 | bin `@timestamp` span=6day | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 25.13ms
   ✅ Iteration 2: 25.52ms
   ✅ Iteration 3: 25.56ms
   ✅ Iteration 4: 25.14ms
   ✅ Iteration 5: 24.87ms
📊 Average time: 25.24ms (5/5 successful)

=== ALIGNTIME BIN OPERATIONS ===
🧪 Testing: Bin @timestamp span=12h aligntime='@d+3h'
📝 Query: source=big5 | bin `@timestamp` span=12h aligntime=\"@d+3h\" | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 24.94ms
   ✅ Iteration 2: 24.10ms
   ✅ Iteration 3: 24.37ms
   ✅ Iteration 4: 24.85ms
   ✅ Iteration 5: 24.13ms
📊 Average time: 24.47ms (5/5 successful)

🧪 Testing: Bin @timestamp span=12h aligntime='@d-1h'
📝 Query: source=big5 | bin `@timestamp` span=12h aligntime=\"@d-1h\" | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 24.46ms
   ✅ Iteration 2: 23.95ms
   ✅ Iteration 3: 23.91ms
   ✅ Iteration 4: 23.81ms
   ✅ Iteration 5: 23.44ms
📊 Average time: 23.91ms (5/5 successful)

🧪 Testing: Bin @timestamp span=12h aligntime=1500000000
📝 Query: source=big5 | bin `@timestamp` span=12h aligntime=1500000000 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 22.98ms
   ✅ Iteration 2: 23.58ms
   ✅ Iteration 3: 23.66ms
   ✅ Iteration 4: 23.16ms
   ✅ Iteration 5: 22.82ms
📊 Average time: 23.24ms (5/5 successful)

=== LOAD TESTING WITH DIFFERENT RESULT SIZES ===
🧪 Testing: Large result set (100 rows)
📝 Query: source=big5 | bin `@timestamp` span=1h | head 100
🔁 Running 3 iterations...
   ✅ Iteration 1: 29.98ms
   ✅ Iteration 2: 30.36ms
   ✅ Iteration 3: 30.07ms
📊 Average time: 30.13ms (3/3 successful)

🧪 Testing: Larger result set (500 rows)
📝 Query: source=big5 | bin `@timestamp` span=1h | head 500
🔁 Running 3 iterations...
   ✅ Iteration 1: 61.30ms
   ✅ Iteration 2: 61.22ms
   ✅ Iteration 3: 61.36ms
📊 Average time: 61.29ms (3/3 successful)

🧪 Testing: Very large result set (1000 rows)
📝 Query: source=big5 | bin `@timestamp` span=1h | head 1000
🔁 Running 2 iterations...
   ✅ Iteration 1: 100.67ms
   ✅ Iteration 2: 100.38ms
📊 Average time: 100.52ms (2/2 successful)

=== PERFORMANCE COMPARISON ===
🧪 Testing: Small numeric span (10)
📝 Query: source=big5 | bin `metrics.size` span=10 | head 50
🔁 Running 3 iterations...
   ✅ Iteration 1: 17.92ms
   ✅ Iteration 2: 18.03ms
   ✅ Iteration 3: 17.70ms
📊 Average time: 17.88ms (3/3 successful)

🧪 Testing: Medium numeric span (100)
📝 Query: source=big5 | bin `metrics.size` span=100 | head 50
🔁 Running 3 iterations...
   ✅ Iteration 1: 17.70ms
   ✅ Iteration 2: 17.42ms
   ✅ Iteration 3: 17.44ms
📊 Average time: 17.52ms (3/3 successful)

🧪 Testing: Large numeric span (1000)
📝 Query: source=big5 | bin `metrics.size` span=1000 | head 50
🔁 Running 3 iterations...
   ✅ Iteration 1: 17.35ms
   ✅ Iteration 2: 17.35ms
   ✅ Iteration 3: 17.70ms
📊 Average time: 17.46ms (3/3 successful)

=== BASELINE COMPARISON ===
🧪 Testing: Baseline (no bin)
📝 Query: source=big5 | head 100
🔁 Running 5 iterations...
   ✅ Iteration 1: 21.36ms
   ✅ Iteration 2: 21.43ms
   ✅ Iteration 3: 21.73ms
   ✅ Iteration 4: 21.64ms
   ✅ Iteration 5: 21.57ms
📊 Average time: 21.54ms (5/5 successful)

🧪 Testing: Baseline with fields
📝 Query: source=big5 | fields `@timestamp`, `metrics.size` | head 100
🔁 Running 5 iterations...
   ✅ Iteration 1: 16.26ms
   ✅ Iteration 2: 16.03ms
   ✅ Iteration 3: 16.16ms
   ✅ Iteration 4: 16.06ms
   ✅ Iteration 5: 15.58ms
📊 Average time: 16.01ms (5/5 successful)

=== CONCURRENT LOAD TEST ===
🚀 Running concurrent load test (3 parallel requests)...
Temp directory: /tmp/tmp.w737guGOZ0
Starting concurrent request 1 (timestamp bin)...
Starting concurrent request 2 (numeric bin)...
Starting concurrent request 3 (bins parameter)...
Concurrent test results:
Request 1:
📊 Average time: 30.66ms (1/1 successful)

Request 2:
📊 Average time: 21.38ms (1/1 successful)

Request 3:
📊 Average time: 18.80ms (1/1 successful)


=== SIMPLE BIN VARIATIONS ===
🧪 Testing: Bin with alias
📝 Query: source=big5 | bin `metrics.size` span=100 as size_bucket | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 22.80ms
   ✅ Iteration 2: 22.48ms
   ✅ Iteration 3: 22.42ms
   ✅ Iteration 4: 22.22ms
   ✅ Iteration 5: 22.45ms
📊 Average time: 22.47ms (5/5 successful)

🧪 Testing: Multiple field bin (sequential)
📝 Query: source=big5 | bin `metrics.size` span=100 | bin `metrics.tmin` span=50 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 14.90ms
   ✅ Iteration 2: 15.20ms
   ✅ Iteration 3: 15.24ms
   ✅ Iteration 4: 15.03ms
   ✅ Iteration 5: 15.35ms
📊 Average time: 15.14ms (5/5 successful)

🧪 Testing: Bin with very small span
📝 Query: source=big5 | bin `metrics.size` span=1 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.20ms
   ✅ Iteration 2: 14.68ms
   ✅ Iteration 3: 14.91ms
   ✅ Iteration 4: 14.89ms
   ✅ Iteration 5: 14.66ms
📊 Average time: 14.86ms (5/5 successful)

🧪 Testing: Bin with very large span
📝 Query: source=big5 | bin `metrics.size` span=100000 | head 10
🔁 Running 5 iterations...
   ✅ Iteration 1: 15.09ms
   ✅ Iteration 2: 14.81ms
   ✅ Iteration 3: 15.05ms
   ✅ Iteration 4: 14.78ms
   ✅ Iteration 5: 14.66ms
📊 Average time: 14.87ms (5/5 successful)


=== SUMMARY ===
✅ Simplified bin command performance testing complete
💡 Key metrics to analyze:
   - Pure bin operation performance without aggregation overhead
   - Performance difference between numeric and timestamp binning
   - Impact of different span sizes and bin parameters
   - Overhead of bin operations vs baseline queries
   - Concurrent request handling capability

🎯 Performance optimization targets:
   - Focus on bin algorithm efficiency
   - Memory usage for binning operations
   - CPU usage during logarithmic span calculations
   - Timestamp parsing and binning performance

============================================
🏁 SIMPLIFIED BIN PERFORMANCE TEST COMPLETE
============================================

@yuancu
Copy link
Collaborator

yuancu commented Jul 21, 2025

I haven't read through the implementations yet, but an initial attempt with the bins option returns an confusing result:

I tested it on the accounts index, where the max age is 40, and the minimum age is 20. The query source=opensearch-sql_test_index_account | bin age bins=4 gives the following logical plan:

LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], age_bin=[+(*(FLOOR(DIVIDE(-($8, 0.0E0:DOUBLE), /(1000.0E0:DOUBLE, 4))), /(1000.0E0:DOUBLE, 4)), 0.0E0)])
  CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_account]])

From the plan, the new age_bin field is calculated as FLOOR($8 / (1000.0 / 4)) * (1000.0 / 4). I am confused where is the 1000 from. According to my understanding, it should be like:

max = max(age)  <-- 40
min = min(age)  <-- 20
span = (max - min) / bins  <-- 5
age_bin = floor((age - min) / span) * span + min
        = floor((age - 20) / 5) * 5 + 20

I guess you did not calculate the max and min, but directly using 1000 and 0?

@yuancu
Copy link
Collaborator

yuancu commented Jul 21, 2025

The PR description is very solid and in detail. It would be better if you could create an issue or link it to an existing issue for this PR.

@yuancu
Copy link
Collaborator

yuancu commented Jul 21, 2025

The start and end options does not work as expected as well. For example, in the result of source=opensearch-sql_test_index_account | bin age span=5 start=30 end=40, age 28 is binned to 25; age 23 is binned to 20.

But the start is 30, there shouldn't be 25 and 20 in the result. The logical plan it gives contains no start and end information either:

LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], age_bin=[*(FLOOR(/($8, 5)), 5)])
  CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_account]])

By the way, for values below the start, should they be converted to the start value or be kept as is?

@yuancu
Copy link
Collaborator

yuancu commented Jul 21, 2025

Could you also add tests in ExplainIT (or CalciteExplainIT if the command is only available since 3.1.0) to validate the generated logical and physical plan?

@ahkcs
Copy link
Contributor Author

ahkcs commented Jul 29, 2025

The start and end options does not work as expected as well. For example, in the result of source=opensearch-sql_test_index_account | bin age span=5 start=30 end=40, age 28 is binned to 25; age 23 is binned to 20.
But the start is 30, there shouldn't be 25 and 20 in the result. The logical plan it gives contains no start and end information either:

LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], age_bin=[*(FLOOR(/($8, 5)), 5)])
  CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_account]])

By the way, for values below the start, should they be converted to the start value or be kept as is?

Thanks for the reply, I have updated our implementation for start/end The start and end parameters don't filter data but instead expand the range used for bin width calculation - for example, bin age start=0 end=99 creates 3 bins (20-30, 30-40, 40-50) while bin age start=0 end=102 creates 1 bin (0-100) with all 1000 records, using "nice number" algorithm that chooses different bin widths based on the total range
I post more details in PR description

@ahkcs
Copy link
Contributor Author

ahkcs commented Jul 29, 2025

I haven't read through the implementations yet, but an initial attempt with the bins option returns an confusing result:

I tested it on the accounts index, where the max age is 40, and the minimum age is 20. The query source=opensearch-sql_test_index_account | bin age bins=4 gives the following logical plan:

LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], age_bin=[+(*(FLOOR(DIVIDE(-($8, 0.0E0:DOUBLE), /(1000.0E0:DOUBLE, 4))), /(1000.0E0:DOUBLE, 4)), 0.0E0)])
  CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_account]])

From the plan, the new age_bin field is calculated as FLOOR($8 / (1000.0 / 4)) * (1000.0 / 4). I am confused where is the 1000 from. According to my understanding, it should be like:

max = max(age)  <-- 40
min = min(age)  <-- 20
span = (max - min) / bins  <-- 5
age_bin = floor((age - min) / span) * span + min
        = floor((age - 20) / 5) * 5 + 20

I guess you did not calculate the max and min, but directly using 1000 and 0?

I updated the implementation. Our current implementation now correctly calculates min(age)=20 and max(age)=40 dynamically, and importantly, now bins=4 doesn't create exactly 4 bins but rather creates at most 4 bins using "nice number" widths - so bins=4 for age 20-40 actually produces 3 bins (20-30, 30-40, 40-50) with width=10 instead of 4 bins with width=5, because now it prioritizes human-readable bin boundaries over exact bin counts. Also, we now modify the original age field in-place instead of creating a separate age_bin field

@yuancu
Copy link
Collaborator

yuancu commented Jul 31, 2025

importantly, now bins=4 doesn't create exactly 4 bins but rather creates at most 4 bins using "nice number" widths - so bins=4 for age 20-40 actually produces 3 bins (20-30, 30-40, 40-50) with width=10 instead of 4 bins with width=5, because now it prioritizes human-readable bin boundaries over exact bin counts.

Could you please give a precise definition of nice number in the documentation?

source=opensearch-sql_test_index_account | bin age bins=1000 as age_cate | fields age_cate, age also results in categories of $$[20, 30), [30, 40), [40, 50)$$. The choice of a minimum interval of 10 seems lacking of explanation?

loadIndex(Index.BANK);
loadIndex(Index.DATE_FORMATS);
loadIndex(Index.WEBLOG);
loadIndex(Index.TIME_TEST_DATA);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not necessary to load this in ExplainIT since you did not use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to CalciteExplainIT


protected final UnresolvedExpression field;

@Nullable protected final String alias;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okey but I don't suggest to use @Nullable annotation. Instead, I suggest to use Optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use Optional

Copy link
Collaborator

@yuancu yuancu Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin Using optional as a field type will cause a warning, e.g. 'Optional<String>' used as type for field 'name' . It seems that using Optional type as field / parameter types is discouraged? Java
'Optional' used as field or parameter type - JetBrains

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yuancu. For now, there is no harmful to use Optional as type of field 'name' because the UnresolvedPlan itself is not serializable.

Comment on lines 486 to 490
if (node.getAlias() != null) {
handleBinWithAlias(context, binExpression, node.getAlias());
} else {
handleBinWithoutAlias(context, binExpression, fieldName);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem but what a Claude-style refactor! These are so straightforward codes, not need to separate to small methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 403 to 501
/**
* Handles bin operation with alias by adding the binned field as a new column.
*
* @param context the Calcite plan context
* @param binExpression the bin expression to be aliased
* @param aliasName the alias name for the new column
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's don't put comments especially for private method when the method name + parameter names explains enough. (It would just become a noise when method name is explicit enough)

+1

Comment on lines 403 to 501
/**
* Handles bin operation with alias by adding the binned field as a new column.
*
* @param context the Calcite plan context
* @param binExpression the bin expression to be aliased
* @param aliasName the alias name for the new column
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the Javadoc, at least the L498~L500, there are noise code IMO.

ahkcs added 2 commits August 24, 2025 21:44
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs requested review from LantaoJin and yuancu August 25, 2025 05:10
Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs requested a review from yuancu August 25, 2025 06:08
@LantaoJin LantaoJin requested a review from qianheng-aws August 25, 2025 07:37
@LantaoJin
Copy link
Member

@qianheng-aws please take another look.

Copy link
Collaborator

@yuancu yuancu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the size and complexity of this PR, it's challenging for me to thoroughly review all the changes and verify every logical detail. I'll provide a preliminary approval so my change request doesn't block progress, though please count this as a half an approval only.

I recommend obtaining approval from all contributors who have been heavily involved in reviewing this PR before merging. @ykmr1224 @qianheng-aws @LantaoJin @Swiddis

@Swiddis Swiddis merged commit 71076f7 into opensearch-project:main Aug 25, 2025
23 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-3878-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 71076f79cf6f7d365408766d5de77b125e57c9b5
# Push it to GitHub
git push --set-upstream origin backport/backport-3878-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-3878-to-2.19-dev.

@ahkcs
Copy link
Contributor Author

ahkcs commented Aug 25, 2025

Given the size and complexity of this PR, it's challenging for me to thoroughly review all the changes and verify every logical detail. I'll provide a preliminary approval so my change request doesn't block progress, though please count this as a half an approval only.

I recommend obtaining approval from all contributors who have been heavily involved in reviewing this PR before merging. @ykmr1224 @qianheng-aws @LantaoJin @Swiddis

Thanks for the review and the preliminary approval. After discussing with the team, we’ve decided to go ahead and merge this PR first, and then address any additional concerns in follow-up issues.

@ykmr1224
Copy link
Collaborator

@ahkcs
Seems backport failed for this PR.
And it is causing backport failure for #4100
Can you fix the backport issue?

ahkcs added a commit to ahkcs/sql that referenced this pull request Aug 29, 2025
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <105710027+ahkcs@users.noreply.github.com>
(cherry picked from commit 71076f7)
ahkcs added a commit to ahkcs/sql that referenced this pull request Aug 29, 2025
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <105710027+ahkcs@users.noreply.github.com>
(cherry picked from commit 71076f7)
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Aug 29, 2025
Swiddis pushed a commit that referenced this pull request Aug 29, 2025
…4171)

* Support ```bin``` command with Calcite (#3878)

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <105710027+ahkcs@users.noreply.github.com>
(cherry picked from commit 71076f7)

* Fix Java version incompatibility

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* remove unrelated tests

Signed-off-by: Kai Huang <ahkcs@amazon.com>

---------

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <105710027+ahkcs@users.noreply.github.com>
@ahkcs ahkcs mentioned this pull request Oct 8, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Implement bin command in PPL

9 participants