-
Notifications
You must be signed in to change notification settings - Fork 217
Description
GFQL Named Matcher Collision Causes Runtime KeyError
Summary
GFQL currently assumes matcher names are unique per entity type. When two node matchers (or two edge matchers) reuse the same name=...
, execution reaches combine_steps()
where pandas/cuDF suffix duplicate columns (dup_x
, dup_y
). The code then looks up the original name (out_df[op._name]
) and raises a raw KeyError
. The same happens if the graph already has a column with that name before the matcher runs.
Minimal Repros
Run with PYTHONPATH=.
so the local tree is used.
import pandas as pd
from graphistry.tests.test_compute import CGFull
from graphistry import n, e_forward
g = (
CGFull()
.nodes(pd.DataFrame({'node': [0, 1, 2, 3]}))
.edges(pd.DataFrame({'source': [0, 1, 2], 'target': [1, 2, 3]}))
.bind(node='node', source='source', destination='target')
)
# 1. Node/node duplicate -> KeyError
chain_node_node = [
n({'node': 0}, name='dup'),
e_forward(),
n({'node': 1}, name='dup'),
]
g.gfql(chain_node_node)
# 2. Edge/edge duplicate -> KeyError
chain_edge_edge = [
n({'node': 0}),
e_forward(name='dup'),
n({'node': 1}),
e_forward(name='dup'),
n({'node': 2}),
]
g.gfql(chain_edge_edge)
# 3. Existing column + matcher name -> KeyError
chain_existing = [
n({'node': 0}),
e_forward(),
n({'node': 1}, name='node'), # graph already has “node” column
]
g.gfql(chain_existing)
# FYI node vs edge reuse is fine (different DataFrames)
chain_node_edge = [
n({'node': 0}, name='dup'),
e_forward(name='dup'),
n({'node': 1}),
]
res = g.gfql(chain_node_edge)
print(res._nodes.columns) # ['node', 'dup']
print(res._edges.columns) # ['dup', 'source', 'target']
Stack trace excerpt:
File ".../graphistry/compute/chain.py", line 221, in combine_steps
s = out_df[op._name]
KeyError: 'dup'
Expected vs Actual
- Expected: Duplicate names should be handled deterministically—either accepted via an explicit policy (e.g. OR, suffix) or rejected during validation with a descriptive error. Pre-existing column collisions should follow the same rules.
- Actual: pandas/cuDF renames duplicates,
combine_steps()
still looks up the original name, and we get an unhelpfulKeyError
at runtime.
Proposed Direction
Based on recent exploration:
- Add a
name_conflicts
(akaconflict_policy
) parameter togfql()
/chain()
(and forward it frommark()
). Suggested options:'any'
(default): OR together duplicate matchers of the same name.'error'
: fail fast before execution with aGFQLSchemaError
pointing to the conflicting operations / existing columns.'suffix'
: allow duplicates by auto-suffixing (dup
,dup_1
, ...), similar to DataFrame behavior but controlled and logged.
- The default
'any'
aligns with analyst expectations (“if any matcher fires, mark it”).'error'
keeps curated templates strict, while'suffix'
supports reporting workflows that want separate columns. - Even with the policy in place, continue validating pre-existing column collisions up front so the error is actionable.
Minimum Fix
Regardless of the policy chosen, guard against duplicate names before combine_steps()
tries to access out_df[op._name]
. Catching the collision early (or auto-resolving according to policy) avoids the raw KeyError and makes the behavior predictable.
Handling the pre-existing column case is important—users often reuse names like node
or flag
. Without this policy, even a single named matcher can crash when that column already exists.