-
Notifications
You must be signed in to change notification settings - Fork 218
Open
Description
Problem
GFQL WHERE planning and pruning can blow up without cheap selectivity estimates. Today we recompute domain stats per query, and we have no persistent, reusable stats tied to a graph/Plotter instance.
Proposal
Introduce a first-class stats layer on the Plotter stack (e.g., g.stats) that is serializable and lazy-loadable. Stats should be computed on demand, cached, and optionally persisted alongside a graph bundle or stored separately for reuse across sessions.
Scope / Requirements
- Serializable: JSON (or msgpack) friendly; versioned schema; safe to round-trip across Python versions.
- Lazy-loadable: compute only when requested; allow a cache-backed mode (in-memory + optional disk).
- First-class on Plotter: accessible via
g.statswith explicit compute APIs; carried withPlotter/Graphistryobjects and optionally included in uploads. - DF-native: pandas + cuDF compatible; avoid
.to_pandas()in hot paths. - Optional: zero overhead unless enabled/asked for.
Candidate Stats (common in graph engines like Neo4j, TigerGraph, etc.)
- Table cardinalities (nodes/edges)
- Per-column NDV (approx OK; HLL-style)
- Per-column min/max + null fraction
- Degree distributions or summary stats (min/max/mean/quantiles)
- Optional: per-label or per-type stats (if labels/types exist)
Why (GFQL priorities)
- Clause ordering / gating based on selectivity
- Inequality bounds pruning (min/max or quantiles)
- Semijoin thresholds for domain intersections
- Query diagnostics (explain-style stats)
Deliverables
- Stats data model + serialization format
Plotter.statsAPI + lazy compute + caching hooks- Integration points in GFQL WHERE planner/executor (behind feature flags)
- Tests for parity, persistence, and cudf/pandas compatibility
Non-goals (initial)
- Full cost-based optimizer
- Cross-graph/global stats registry
Notes
This is intended to unblock GFQL planning and pruning work without baking in a full optimizer.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels