Skip to content

Conversation

@Maxxen
Copy link
Member

@Maxxen Maxxen commented Apr 15, 2025

This PR updates the workflow and pinned DuckDB version in the v1.2.2 branch to DuckDB v1.2.2. It also adds a new physical operator, the SPATIAL_JOIN.

Spatial Joins

Executing spatial joins in DuckDB did not scale well, as the only method available for the query engine to execute them was through the "blockwise nested-loop join" operator. The blockwise-nl-join operator uses the simplest possible strategy to join two datasets, namely by comparing each item on the left side with the each item on the right side, evaluating the join condition for each pair. This means that joining a dataset of size n with another of size m requires m*n comparisons. In spatial, where the join conditions tend to be pretty costly to evaluate (e.g. polygon-polygon intersection) the effect is that a lot of spatial join queries just hit the wall as the dataset sizes grow, despite DuckDB generally being pretty good at brute-forcing through its parallelized vectorized execution.

Enter, the SPATIAL_JOIN operator!
Compared to the BLOCKWISE_NL_JOIN, the spatial join operator creates a temporary spatial index on-the-fly on the "build side" of the join which it then uses to quickly filter out only those rows whose bounding boxes intersect with the probe-side value and thus "might" be a match. Bounding box intersection checks are comparatively cheap, and so this is a lot more efficient than having to evaluate the actual join predicate on every pairing.

The spatial join operator is instantiated any time there is a INNER/LEFT/RIGHT/OUTER join using one of the following predicates as the join condition:

  • ST_Equals
  • ST_Intersects
  • ST_Touches
  • ST_Crosses
  • ST_Within
  • ST_Contains
  • ST_Overlaps
  • ST_Covers
  • ST_CoveredBy
  • ST_ContainsPropery

Limitations:

There are currently some limitations, but Im very much interested in working on resolving these in the future.

  • The spatial join operator only supports a single join condition for now
  • SEMI/ANTI joins are not yet supported either
  • The "build side" of the join needs to be able to fit in memory. However, all memory used to construct the temporary index is tracked by DuckDB, and should respect any set memory limit.
    • This might sound limiting, but as it stands executing spatial joins before this PR tend to hit the combinatorial explosion performance wall way before the build side grows large enough to cause memory issues, so this initial spatial join operator will still allow you to execute larger joins that you probably couldn't before.

Example

Here's a quick example/benchmark:

CREATE TABLE lhs AS
SELECT
    ST_Point(x, y) as geom,
    (y * 50) + x // 10 as id
FROM
    generate_series(0, 1000, 5) r1(x),
    generate_series(0, 1000, 5) r2(y);

-- 40401 rows

CREATE TABLE rhs AS
SELECT
    ST_Buffer(ST_Point(x, y), 5) as geom,
    (y * 50) + x // 10 as id
FROM
    generate_series(0, 500, 10) r1(x),
    generate_series(0, 500, 10) r2(y);

-- 2601 rows

EXPLAIN SELECT * FROM rhs JOIN lhs ON st_intersects(lhs.geom, rhs.geom);
----
┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            geom           │
│             id            │
│            geom           │
│             id            │
│                           │
│        ~40401 Rows        │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│        SPATIAL_JOIN       │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────┐
│ ST_Intersects(geom, geom) │              │
│                           │              │
│        ~40401 Rows        │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│         Table: lhs        ││         Table: rhs        │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│            geom           ││            geom           │
│             id            ││             id            │
│                           ││                           │
│        ~40401 Rows        ││         ~2601 Rows        │
└───────────────────────────┘└───────────────────────────┘

Executing this in the CLI with .timer on gives me :

Run Time (s): real 0.150 user 0.030077 sys 0.001769 on my M3 MacBook Pro.

If we now disable the optimizer with pragma disabled_optimizers = 'extension'; and rerun, we get:

Run Time (s): real 10.470 user 20.675175 sys 0.011965

Now, this is a very contrived example because the geometries are very uniform and half of them trivially match, but it still illustrates the performance improvements.

Other changes

  • ST_Extent_Agg is now precise instead of relying on cached bounds
  • Fixes a bug causing ST_Translate not to work properly for xy geometries.

@Maxxen Maxxen requested a review from Copilot April 15, 2025 17:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 25 out of 29 changed files in this pull request and generated 1 comment.

Files not reviewed (4)
  • Makefile: Language not supported
  • duckdb: Language not supported
  • src/spatial/CMakeLists.txt: Language not supported
  • src/spatial/operators/CMakeLists.txt: Language not supported
Comments suppressed due to low confidence (11)

src/spatial/spatial_optimizers.cpp:1

  • The removal of spatial_optimizers.cpp is significant; ensure that the optimization logic it contained is fully integrated into the new SpatialJoinOptimizer.
-#include "spatial/spatial_types.hpp"

src/spatial/spatial_extension.cpp:16

  • [nitpick] The include directive was updated to reference the new spatial join optimizer; please verify that all dependent modules have been updated accordingly.
#include "spatial/operators/spatial_join_optimizer.hpp"

src/spatial/modules/main/spatial_functions_scalar.cpp:2664

  • Changing the bounding box type from double to float in ST_Extent_Approx may reduce precision; please verify that this trade-off is acceptable.
Box2D<float> bbox;

src/spatial/modules/main/spatial_functions_scalar.cpp:5196

  • Ensure that sgl::util::hilbert_encode provides equivalent results and performance compared to the previous local HilbertEncode implementation.
const auto h = sgl::util::hilbert_encode(16, hilbert_x, hilbert_y);

src/spatial/modules/main/spatial_functions_aggregate.cpp:145

  • Switching the aggregate function input type from geometry_t to string_t should be validated to ensure compatibility with downstream spatial functions.
const auto agg = AggregateFunction::UnaryAggregate<ExtentAggState, string_t, string_t, ExtentAggFunction>(

src/spatial/modules/geos/geos_module.cpp:286

  • [nitpick] The stray semicolon was removed from ST_Buffer; this clean-up improves readability and reduces potential confusion.
return lstate.Serialize(result, buffer);

src/spatial/index/rtree/rtree_index_plan_scan.cpp:84

  • Verify that extracting the bounding box directly as a float without explicit MathUtil conversion maintains the expected precision for index scans.
static bool TryGetBoundingBox(const Value &value, Box2D<float> &bbox) {

src/spatial/index/rtree/rtree_index.cpp:170

  • Transitioning from double to float in bounding box extraction should be reviewed; ensure any precision loss does not affect the spatial index accuracy.
if (!geom_data[i].TryGetCachedBounds(bbox)) {

src/spatial/geometry/geometry_type.hpp:102

  • The conversion from double to float in TryGetCachedBounds may introduce precision loss; please confirm that this trade-off is acceptable for spatial operations.
bool TryGetCachedBounds(Box2D<float> &bbox) const {

src/sgl/sgl.hpp:1040

  • Changing the default values for 'old_vertex' from {0, 0, 0, 0} to {0, 0, 1, 1} in affine_transform may affect transformation results; please verify that these new defaults are intended.
vertex_xyzm old_vertex = {0, 0, 1, 1};

.github/workflows/StableDistributionPipeline.yml:14

  • CI workflow configuration was updated to use DuckDB v1.2.2; ensure all pipeline steps and branch triggers are compatible with the new version.
- v1.2.2

@Maxxen Maxxen merged commit 26947a0 into duckdb:v1.2.2 Apr 16, 2025
23 checks passed
@aborruso
Copy link
Contributor

Hi @Maxxen I have duckdb 1.2.2 and duckdb spatial 3bb37f8.

How to use this great new feature?

Thank you

@brpy
Copy link

brpy commented May 22, 2025

Great feature @Maxxen, much awaited!

This significantly improves our workloads. Some of our workloads which were not possible earlier are now getting executed in just a few minutes.

Thanks a lot!

martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Jul 7, 2025
- Remove invalid WHERE clauses with aggregates that caused SQL errors
- Simplify all spatial joins to use single ST_Intersects conditions
- Enable SPATIAL_JOIN operator for massive performance improvements
- Remove area threshold filters from spatial joins
- Update geometry validator to better detect SPATIAL_JOIN usage
- Fixes 'WHERE clause cannot contain aggregates' error in wetlands analysis

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Jul 10, 2025
…or BBR buildings

Major performance improvements:
- Add perform_spatial_join_optimized() function leveraging SPATIAL_JOIN operator
- Create temporary spatial index on-the-fly for massive performance gains
- Use bounding box intersection for fast filtering before expensive spatial ops
- Structure queries to trigger SPATIAL_JOIN operator with proper JOIN syntax
- Add smart fallback system to chunked processing if needed
- Integrate all spatial optimizations from field analysis pipeline:
  - ST_Dump() for complex geometries
  - Minimum area filtering (1m² geometry, 5m² building)
  - Proper spatial functions (ST_Area_Spheroid, ST_IsValid)
  - Configurable memory management and batch processing
- Add configuration for SPATIAL_JOIN operator settings
- Create demonstration script showing SPATIAL_JOIN usage patterns
- Update GitHub Actions workflow to use optimized processing
- Add monitoring to detect when SPATIAL_JOIN operator is used

References: duckdb/duckdb-spatial#545

This positions the pipeline for future spatial operations like:
- Buildings × Agricultural fields
- Buildings × Parcels
- Buildings × Administrative boundaries
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Jul 12, 2025
…l wetland coverage

MAJOR REFACTORING (5 stages → 4 stages):
- Stage 1: Foundation data creation (water projects × BNBO/wetlands, fields × properties)
- Stage 2: Environmental coverage analysis using foundation data (NEW optimized)
- Stage 3: Property-environmental spatial analysis (ENHANCED)
- Stage 4: Consolidation with comprehensive property-environmental relationships

STAGE 2 OPTIMIZATION (NEW):
- Rewrote Stage 2A/2B to use foundation data from Stage 1A/1B
- Implemented single spatial predicate joins (ST_Intersects only)
- DuckDB Spatial PR #545 compliance for SPATIAL_JOIN operator
- Batched processing with memory management for GitHub Actions

STAGE 3 ENHANCEMENT:
- Enhanced Stage 3B with property-level wetland water project coverage analysis
- New fields: property_wetland_covered_by_water_m2, property_wetland_not_covered_by_water_m2
- Property owner lists for covered and uncovered wetlands
- Answers: 'How much wetlands covered by water projects is owned by property X?'

STAGE 4 UPDATES:
- Updated consolidation to include new property-level metrics
- Comprehensive statistics and logging for monitoring

PERFORMANCE OPTIMIZATIONS:
- Foundation data approach avoids expensive recalculations
- Single spatial predicates trigger SPATIAL_JOIN operator
- Batched processing with configurable memory cleanup
- 5x faster GCS operations via optimized data access

CONFIGURATION:
- Updated CLI and config for new 4-stage architecture
- Fixed missing imports and artifact paths
- Proper stage dependencies and data flow

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Jul 12, 2025
… wetland IDs

MAJOR OPTIMIZATION: Use existing foundation data instead of recalculating spatial intersections

Stage 1B (Wetlands):
- Add unique wetland_id to foundation data for efficient joins
- Update intersection table schema to include wetland_id
- Enable Stage 3B to use ID-based joins instead of spatial recalculation

Stage 3A (BNBO) & Stage 3B (Wetlands):
- Replace complex 3-way spatial analysis with efficient foundation data usage
- Use existing water_projects_bnbo_intersections (Stage 1A) and water_projects_wetlands_intersections (Stage 1B)
- Match property-environmental intersections with existing water coverage by ID
- Calculate accurate property-level coverage without recalculating spatial intersections

PERFORMANCE IMPROVEMENTS:
- Eliminate duplicate spatial calculations between stages
- Use efficient ID-based joins (bnbo_id, wetland_id) instead of expensive spatial operations
- Maintain DuckDB Spatial PR #545 compliance with single spatial predicates
- Preserve accurate property-level coverage calculations within each field

TECHNICAL FIXES:
- Add proper wetland ID generation in Stage 1B batch processing
- Update Stage 3A to use existing BNBO-water intersection data
- Simplify Stage 3B to leverage existing wetland-water intersection data
- Remove complex 3-way spatial analysis that duplicated Stage 1 work

RESULT:
- Faster pipeline execution through foundation data reuse
- Accurate property-level environmental coverage analysis
- No loss of spatial accuracy - still proper field-level percentages
- Maintains all analytical capabilities while eliminating redundant calculations

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Jul 13, 2025
… operator

- Add Stage 0 soil_types pre-filter (13K → 8K polygons, 40% reduction)
- Implement ST_Dump + UNNEST for optimal multipolygon decomposition
- Use single spatial predicate (ST_Intersects only) to trigger SPATIAL_JOIN operator
- Remove redundant geometry storage and expensive duplicate calculations
- Add foundation data output with soil_id for efficient downstream joins
- Update multi-stage YAML workflow with soil_types pre-filter job and dependencies
- Ensure Stage 4 compatibility with optimized output format

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Jul 15, 2025
… SPATIAL_JOIN operator

- Fix ST_GeomFromText error by detecting geometry column types (VARCHAR vs GEOMETRY)
- Optimize spatial join for DuckDB Spatial PR #545 SPATIAL_JOIN operator compliance
- Create pre-filtered tables to remove NULL geometries before spatial join
- Use single ST_Intersects predicate in JOIN condition for optimal performance
- Add SPATIAL_JOIN operator detection and verification
- Clean up temporary filtered tables to prevent accumulation
- Update logging to track spatial join success rates

Fixes: ST_GeomFromText requires a string argument error in 2008 spatial join
References: duckdb/duckdb-spatial#545
aleksanderbl29 added a commit to aleksanderbl29/landbruget.dk that referenced this pull request Jul 24, 2025
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 3, 2025
…TIAL_JOIN optimization

- Add _enrich_marker_with_organic_data() method to FVMWFSSilver class
- Perform spatial matching between marker fields and organic areas using ST_Contains(marker.geometry, ST_Centroid(organic.geometry))
- Add 4 new columns to marker fields: is_organic, organic_conversion_date, organic_deregistration_date, organic_conversion_status
- Process all overlapping years (2012-2024) with ~92% organic field coverage based on analysis
- Optimized for DuckDB Spatial PR #545 SPATIAL_JOIN operator compliance:
  * Single spatial predicate in JOIN condition to trigger SPATIAL_JOIN operator
  * Pre-filter NULL geometries for optimal spatial indexing performance
  * Move non-spatial conditions (field_id match) to WHERE clause
  * Add SPATIAL_JOIN operator detection and verification logging
- Add organic-enrichment job to GitHub Actions workflow that runs after matrix processing
- Enriched marker data overwrites original fvm_marker_{year} datasets in GCS
- Comprehensive error handling and logging for monitoring enrichment progress

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 4, 2025
…ckDB Spatial PR #545)

MAJOR PERFORMANCE FIX: Replace field-level Cartesian products with direct spatial joins

ISSUE IDENTIFIED:
- Stage 3 was creating every property × every wetland in same field (24x explosion: 1.08M → 26.33M records)
- Fields with high wetland fragmentation (avg 16.1 fragments, max 2,844) caused massive performance issues
- One field: 2,844 wetlands × 2 properties = 5,688x explosion per field

SOLUTION IMPLEMENTED:
- Replace ID-based JOIN + spatial filtering with direct spatial JOIN
- Use ST_Intersects in JOIN ON clause to trigger SPATIAL_JOIN operator
- Only create records where geometries actually intersect spatially
- Move non-spatial conditions (NULL checks) to WHERE clause

WETLAND PROCESSING (final_wetland.py):
- Before: p JOIN fw ON field_uuid = field_uuid (Cartesian within field)
- After: p JOIN fw ON field_uuid = field_uuid AND ST_Intersects(p.geom, fw.geom)

BNBO PROCESSING (final_bnbo.py):
- Before: p JOIN fb ON field_uuid = field_uuid (Cartesian within field)
- After: p JOIN fb ON field_uuid = field_uuid AND ST_Intersects(p.geom, fb.geom)

DUCKDB SPATIAL PR #545 COMPLIANCE:
✅ Single spatial predicate (ST_Intersects only)
✅ Spatial condition in JOIN clause (triggers SPATIAL_JOIN operator)
✅ Non-spatial conditions in WHERE clause
✅ Eliminates blockwise nested-loop join performance issues

EXPECTED PERFORMANCE IMPROVEMENT:
- From 24x record explosion to only actual spatial intersections
- SPATIAL_JOIN operator with spatial indexing instead of brute-force comparisons
- Proper data volumes for downstream Stage 4 consolidation

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 7, 2025
… (PR #545)

- Replace expensive ST_Difference with efficient ST_Intersects in NOT EXISTS clause
- Use SPATIAL_JOIN-compatible operations for peat percentage overlap resolution
- Strategy: Exclude overlapping 6-12% areas entirely, keep >12% priority
- Simplify Stage 4 aggregation since dissolved wetlands already handle overlaps
- Remove complex peat percentage prioritization logic from gold layer
- Switch from ST_Union_Agg to simple SUM() for wetland area calculations
- Performance improvement: SPATIAL_JOIN operator triggers spatial indexing

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 7, 2025
…gregation pipeline

- Add proximity analysis directly to PesticideDisaggregationGold processor
- Create single comprehensive dataset with both disaggregation + proximity columns
- Add residential/educational building proximity (100m) with addresses and distances
- Add water feature proximity analysis with closest distance calculations
- Optimize for DuckDB Spatial SPATIAL_JOIN operator (PR #545) performance
- Reuse existing agricultural fields data (marker table) for efficiency
- Add 5 new proximity columns to disaggregated_pesticide_applications table
- Enable proximity analysis by default with configurable distance thresholds
- Graceful error handling when buildings/water datasets unavailable

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 8, 2025
…iance (PR #545)

PESTICIDE DISAGGREGATION:
- Revert back to single CTE query structure (more efficient than separate table approach)
- Keep ST_DWithin for proximity logic (simpler than ST_Intersects + ST_Buffer)
- Use batch_size=100 for memory management
- This provides the exact same functionality but with better batching

WATER PROJECTS WETLANDS:
- Add intelligent gold partition selection for wetland_key column compatibility
- Check schema of each partition to find one containing required wetland_key
- Ensures Stage 1 can find Stage 0 output with correct md5-based wetland_key
- Fallback mechanism for schema evolution between pipeline stages

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 9, 2025
- Replace ST_DWithin with ST_Intersects + ST_Buffer pattern for SPATIAL_JOIN optimization
- Implement memory-safe chunked processing (1000 fields per batch)
- Add comprehensive progress tracking and logging
- Performance improvement: 2,679 fields/second processing rate
- Estimated processing time: 1.4 minutes for 227k fields (vs previous crashes)
- Proper coordinate system handling: UTM Zone 32N (EPSG:25832) for accurate 100m distances
- Optimize all proximity analyses: residential buildings, educational facilities, water features

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 9, 2025
MAJOR PERFORMANCE FIX: Replace complex nested queries with clean step-by-step approach

ISSUE IDENTIFIED:
- Complex nested queries with CROSS JOIN LATERAL were extremely slow (6+ minutes per batch)
- Crazy nested subqueries prevented SPATIAL_JOIN operator from triggering
- ST_DWithin in WHERE clause caused fallback to blockwise nested-loop join

SOLUTION IMPLEMENTED:
- Break down into simple 3-step process per batch:
  1. Create small field batch table (500 fields)
  2. Create pre-filtered building table (residential/educational only, pre-transformed)
  3. Simple spatial join (ST_Intersects + ST_Buffer pattern)

SPATIAL_JOIN COMPLIANCE (PR #545):
✅ Simple table-to-table joins (no complex nesting)
✅ Single spatial predicate (ST_Intersects only)
✅ Pre-transformed geometries (avoid repeated ST_Transform)
✅ Clean query structure (like PR example)

FUNCTIONALITY PRESERVED:
✅ Full addresses with distances: 'Address:25.3m'
✅ Sorted by proximity (closest first)
✅ Same output format as before
✅ All three proximity types: residential, educational, water

Expected performance improvement: From 6+ minutes per batch to seconds

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 25, 2025
…atial SPATIAL_JOIN operator

MAJOR MEMORY FIX: Remove geometry text conversion that was consuming gigabytes
- Remove ST_AsText(f.geometry) as geometry_wkt from all queries
- Keep spatial joins for DST region matching but eliminate text conversion
- 687k complex field polygons as WKT strings were filling 5.5GB temp directory

SPATIAL_JOIN OPERATOR COMPLIANCE (PR #545):
✅ Replace ST_Within with ST_Intersects for better SPATIAL_JOIN triggering
✅ Clean table-to-table join structure (no complex nesting)
✅ Single spatial predicate in JOIN clause
✅ Use COALESCE for null handling instead of complex WHERE conditions

MEMORY OPTIMIZATIONS:
- Increase max_temp_directory_size from 6GB to 10GB
- Reduce spatial_join_batch_size from 100k to 50k records
- Add intermediate checkpoints between spatial join and production estimates
- Aggressive cleanup with VACUUM and garbage collection

Expected result: Eliminate 'Out of Memory Error: failed to offload data block'
during production estimates phase while maintaining accurate DST region matching.

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Aug 26, 2025
…atial joins

DECIMAL CONSTRAINT FIX:
- Explicitly define final_production_estimates table schema with DOUBLE for area_ha
- Prevents DuckDB from inferring restrictive DECIMAL(2,1) constraint during Parquet export
- Resolves 'Could not cast value 10.700000 to DECIMAL(2,1)' error

SPATIAL JOIN OPTIMIZATION (PR duckdb/duckdb-spatial#545 compliance):
- Remove complex subqueries from spatial joins to enable SPATIAL_JOIN operator
- Use clean table-to-table join structure: LEFT JOIN dst_zones z ON ST_Intersects(...)
- Reduce spatial_join_batch_size from 50,000 to 5,000 records for memory control
- Enable proper spatial indexing and bounding box optimization

MEMORY MANAGEMENT:
- Batch size reduction prevents memory explosion during spatial operations
- Explicit schema prevents type inference issues during export phase
- Maintains all functionality while fixing GitHub Actions memory constraints

References: duckdb/duckdb-spatial#545
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Sep 13, 2025
MEMORY FIX: Replace expensive geometry calculations with efficient coordinate swap
- Remove ST_Centroid() calculations on 610k MULTIPOLYGON fields (memory intensive)
- Use ST_FlipCoordinates() for direct X/Y coordinate swap (much more efficient)
- Increase emergency_memory_threshold to 85% (75% was too aggressive)
- Increase spatial_join_batch_size to 100k (SPATIAL_JOIN handles larger batches)

ROOT CAUSE: Converting 610k complex multipolygons to points via ST_Centroid was
causing memory bloat, not the SPATIAL_JOIN operator itself.

SPATIAL_JOIN compliance (duckdb/duckdb-spatial#545):
✅ Efficient coordinate transformation without geometry type conversion
✅ Preserves original MULTIPOLYGON geometries for accurate spatial joins
✅ Leverages spatial indexing for optimal performance
martincollignon added a commit to Klimabevaegelsen/landbruget.dk that referenced this pull request Sep 23, 2025
… compliance

MAJOR PERFORMANCE FIX: Rewrite proximity filtering to leverage SPATIAL_JOIN operator

ISSUES WITH PREVIOUS APPROACH:
- Complex nested CTE with CROSS JOIN prevented SPATIAL_JOIN optimization
- ST_DWithin in WHERE clause forced blockwise nested-loop join
- Repeated ST_Transform calls in complex aggregation queries
- Memory issues with large building datasets

SPATIAL_JOIN COMPLIANCE (PR duckdb/duckdb-spatial#545):
✅ Clean step-by-step approach (no complex nesting)
✅ Simple table-to-table JOINs with ST_Intersects
✅ Pre-transformed geometries (avoid repeated ST_Transform)
✅ Single spatial predicate per join operation
✅ Chunked processing for memory safety (10k buildings per batch)

PERFORMANCE IMPROVEMENTS:
- Pre-transform buildings to UTM Zone 32N once
- Pre-buffer agricultural fields to 100m once
- Use ST_Intersects with buffered geometries (SPATIAL_JOIN optimized)
- Process in 10k building chunks to respect memory limits
- Maintain exact same functionality with BBR code names

Expected result: Dramatic performance improvement from spatial indexing
instead of brute-force n*m comparisons on large building datasets.

References: duckdb/duckdb-spatial#545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants