Skip to content

GFQL named matcher collision causes runtime KeyError #818

@lmeyerov

Description

@lmeyerov

GFQL Named Matcher Collision Causes Runtime KeyError

Summary

GFQL currently assumes matcher names are unique per entity type. When two node matchers (or two edge matchers) reuse the same name=..., execution reaches combine_steps() where pandas/cuDF suffix duplicate columns (dup_x, dup_y). The code then looks up the original name (out_df[op._name]) and raises a raw KeyError. The same happens if the graph already has a column with that name before the matcher runs.

Minimal Repros

Run with PYTHONPATH=. so the local tree is used.

import pandas as pd
from graphistry.tests.test_compute import CGFull
from graphistry import n, e_forward

g = (
    CGFull()
    .nodes(pd.DataFrame({'node': [0, 1, 2, 3]}))
    .edges(pd.DataFrame({'source': [0, 1, 2], 'target': [1, 2, 3]}))
    .bind(node='node', source='source', destination='target')
)

# 1. Node/node duplicate -> KeyError
chain_node_node = [
    n({'node': 0}, name='dup'),
    e_forward(),
    n({'node': 1}, name='dup'),
]
g.gfql(chain_node_node)

# 2. Edge/edge duplicate -> KeyError
chain_edge_edge = [
    n({'node': 0}),
    e_forward(name='dup'),
    n({'node': 1}),
    e_forward(name='dup'),
    n({'node': 2}),
]
g.gfql(chain_edge_edge)

# 3. Existing column + matcher name -> KeyError
chain_existing = [
    n({'node': 0}),
    e_forward(),
    n({'node': 1}, name='node'),  # graph already has “node” column
]
g.gfql(chain_existing)

# FYI node vs edge reuse is fine (different DataFrames)
chain_node_edge = [
    n({'node': 0}, name='dup'),
    e_forward(name='dup'),
    n({'node': 1}),
]
res = g.gfql(chain_node_edge)
print(res._nodes.columns)  # ['node', 'dup']
print(res._edges.columns)  # ['dup', 'source', 'target']

Stack trace excerpt:

File ".../graphistry/compute/chain.py", line 221, in combine_steps
    s = out_df[op._name]
KeyError: 'dup'

Expected vs Actual

  • Expected: Duplicate names should be handled deterministically—either accepted via an explicit policy (e.g. OR, suffix) or rejected during validation with a descriptive error. Pre-existing column collisions should follow the same rules.
  • Actual: pandas/cuDF renames duplicates, combine_steps() still looks up the original name, and we get an unhelpful KeyError at runtime.

Proposed Direction

Based on recent exploration:

  1. Add a name_conflicts (aka conflict_policy) parameter to gfql() / chain() (and forward it from mark()). Suggested options:
    • 'any' (default): OR together duplicate matchers of the same name.
    • 'error': fail fast before execution with a GFQLSchemaError pointing to the conflicting operations / existing columns.
    • 'suffix': allow duplicates by auto-suffixing (dup, dup_1, ...), similar to DataFrame behavior but controlled and logged.
  2. The default 'any' aligns with analyst expectations (“if any matcher fires, mark it”). 'error' keeps curated templates strict, while 'suffix' supports reporting workflows that want separate columns.
  3. Even with the policy in place, continue validating pre-existing column collisions up front so the error is actionable.

Minimum Fix

Regardless of the policy chosen, guard against duplicate names before combine_steps() tries to access out_df[op._name]. Catching the collision early (or auto-resolving according to policy) avoids the raw KeyError and makes the behavior predictable.

Handling the pre-existing column case is important—users often reuse names like node or flag. Without this policy, even a single named matcher can crash when that column already exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions