Skip to content

Add serializable, lazy-loadable graph stats on Plotter for GFQL optimization #900

@lmeyerov

Description

@lmeyerov

Problem

GFQL WHERE planning and pruning can blow up without cheap selectivity estimates. Today we recompute domain stats per query, and we have no persistent, reusable stats tied to a graph/Plotter instance.

Proposal

Introduce a first-class stats layer on the Plotter stack (e.g., g.stats) that is serializable and lazy-loadable. Stats should be computed on demand, cached, and optionally persisted alongside a graph bundle or stored separately for reuse across sessions.

Scope / Requirements

  • Serializable: JSON (or msgpack) friendly; versioned schema; safe to round-trip across Python versions.
  • Lazy-loadable: compute only when requested; allow a cache-backed mode (in-memory + optional disk).
  • First-class on Plotter: accessible via g.stats with explicit compute APIs; carried with Plotter/Graphistry objects and optionally included in uploads.
  • DF-native: pandas + cuDF compatible; avoid .to_pandas() in hot paths.
  • Optional: zero overhead unless enabled/asked for.

Candidate Stats (common in graph engines like Neo4j, TigerGraph, etc.)

  • Table cardinalities (nodes/edges)
  • Per-column NDV (approx OK; HLL-style)
  • Per-column min/max + null fraction
  • Degree distributions or summary stats (min/max/mean/quantiles)
  • Optional: per-label or per-type stats (if labels/types exist)

Why (GFQL priorities)

  • Clause ordering / gating based on selectivity
  • Inequality bounds pruning (min/max or quantiles)
  • Semijoin thresholds for domain intersections
  • Query diagnostics (explain-style stats)

Deliverables

  1. Stats data model + serialization format
  2. Plotter.stats API + lazy compute + caching hooks
  3. Integration points in GFQL WHERE planner/executor (behind feature flags)
  4. Tests for parity, persistence, and cudf/pandas compatibility

Non-goals (initial)

  • Full cost-based optimizer
  • Cross-graph/global stats registry

Notes

This is intended to unblock GFQL planning and pruning work without baking in a full optimizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions