feat(gfql): Implement mark() operation for pattern annotation (#755) #815

lmeyerov · 2025-10-19T08:07:02Z

Summary

Implements mark mode for GFQL that annotates nodes/edges with boolean columns without filtering, enabling multi-stage pattern detection and visualization of match vs non-match entities.

Closes #755

Implementation

Core Features

Instance method: g.mark(gfql=[n({'type': 'person'})], name='is_person')
Call operation: g.gfql(call('mark', {'gfql': [n()], 'name': 'hit'}))
Boolean columns: True for matches, False for non-matches (all entities preserved)
Target inference: Automatically determines nodes vs edges from GFQL pattern
Multi-format support: Handles Chain, List[ASTObject], and JSON GFQL (remote execution)

Files Changed

graphistry/compute/mark.py: Core implementation with full validation
graphistry/compute/ComputeMixin.py: Method wrapper integration
graphistry/compute/gfql/call_safelist.py: Safelist registration for call('mark')
docs/source/gfql/builtin_calls.rst: Comprehensive documentation
docs/design/mark-mode-design.md: Full design document
graphistry/tests/compute/test_mark.py: 24 comprehensive tests

Examples

Basic Marking

# Mark VIP customers - all nodes preserved
g2 = g.mark(gfql=[n({'customer_type': 'VIP'})], name='is_vip')
# Result: g2._nodes has 'is_vip' column (True for VIPs, False for others)

Accumulating Marks

g2 = g.mark(gfql=[n({'type': 'person'})], name='is_person')
g3 = g2.mark(gfql=[n({'region': 'EMEA'})], name='is_emea')
# Result: Both 'is_person' and 'is_emea' columns present

In let() DAG

g.gfql(let({
    'marked_people': call('mark', {
        'gfql': [n({'type': 'person'})],
        'name': 'is_person'
    }),
    'marked_vips': ref('marked_people', [
        call('mark', {
            'gfql': [n({'vip': True})],
            'name': 'is_vip'
        })
    ]),
    # Filter to VIPs
    'vips': ref('marked_vips', [n({'is_vip': True})])
}))

With Visualization

g.gfql([
    call('mark', {'gfql': [n({'risk': gt(0.8)})], 'name': 'high_risk'}),
    call('encode_point_color', {
        'column': 'high_risk',
        'categorical_mapping': {True: 'red', False: 'green'}
    })
])

Testing

24 comprehensive unit tests covering:

✅ Safelist registration and parameter validation
✅ Basic node and edge marking
✅ All-match and no-match scenarios
✅ Multiple mark accumulation
✅ call('mark') execution with exact graph verification
✅ JSON GFQL deserialization for remote execution
✅ let() DAG composition with marks
✅ Edge cases (empty graphs, no nodes/edges)

All tests passing: 24/24 ✅

Enhanced test rigor for call('mark'):

Verifies exact boolean values for specific entities
Verifies all other columns preserved
Verifies edges unchanged (pd.testing.assert_frame_equal)
Confirms JSON and list GFQL produce identical results

Design Decisions

Column naming: User-provided names, error on collision (not auto-increment)
Boolean values: True/False (not True/NaN) for cleaner downstream use
Target inference: Single method, infers nodes vs edges from GFQL pattern
let() semantics: Marks accumulate across bindings naturally

Documentation

✅ Full API documentation in docstrings
✅ Design document: docs/design/mark-mode-design.md
✅ Reference documentation: docs/source/gfql/builtin_calls.rst
✅ 5 detailed examples in docs
✅ Use cases: fraud detection, security analysis, social networks

Commits

49b6617e - feat(gfql): Implement mark() operation
e99cc957 - test(gfql): Add comprehensive tests for mark()
4d2babac - test(gfql): Enhance call('mark') test rigor
987c2d05 - docs(gfql): Add mark() to builtin calls reference

Checklist

- Add type annotations for id_col union type - Handle None cases for _node, _source, _destination bindings - Add type narrowing assertions for node vs edge cases - Use type: ignore for runtime-added gfql method Fixes python-lint-types CI failures on PR #815. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add mark mode for GFQL that annotates nodes/edges with boolean columns without filtering, enabling multi-stage pattern detection and visualization of matches. Implementation: - Add mark() method to ComputeMixin with full validation - Register mark in call_safelist.py for call('mark') support - Handle GFQL execution (Chain/list/JSON) with proper type coercion - Create boolean columns: True for matches, False for non-matches - Support both node and edge marking inferred from GFQL pattern - Comprehensive parameter validation and error messages Design: - User-provided column names (error on collision) - Marks accumulate across let() bindings - Validated against internal column name conflicts Files: - graphistry/compute/mark.py: Core implementation - graphistry/compute/ComputeMixin.py: Method wrapper - graphistry/compute/gfql/call_safelist.py: Safelist registration - docs/design/mark-mode-design.md: Full design document Example usage: g.mark(gfql=[n({'type': 'person'})], name='is_person') g.gfql(call('mark', {'gfql': [n()], 'name': 'hit'})) Related: #755

Add 24 unit tests covering all mark() functionality: - Safelist registration and parameter validation - Basic node and edge marking - Boolean column creation (True/False values) - All-match and no-match scenarios - Parameter validation (type checking, empty values, collisions) - Internal column name protection - Multiple mark accumulation - call('mark') execution - JSON GFQL deserialization for remote execution - let() DAG composition with marks - Chain integration - Edge cases (no nodes, no edges) Fixes for test failures: - Added empty gfql validation before chain access - Convert ValueError to GFQLSchemaError for internal column conflicts - Updated test assertions to match actual error messages All tests passing: 24/24 Related: #755

Add more comprehensive assertions to call() tests: - Verify exact boolean values for specific nodes (not just counts) - Verify other columns are preserved - Verify edges are unchanged (pd.testing.assert_frame_equal) - Compare JSON vs list GFQL forms produce identical results This addresses testing gap where call() tests were less rigorous than direct mark() tests. Now we have high confidence that call('mark') returns the exact expected graph structure. All tests still passing: 24/24

Add comprehensive documentation for the new mark() call operation: - Parameter table with types and requirements - Five detailed examples showing various use cases - Use cases for fraud detection, security, social network analysis - Comparison table: mark vs filter operations - Schema effects and implementation notes - Integration with let(), visualization, and multi-mark workflows Placed in 'Filtering and Transformation Methods' section after filter_edges_by_dict and before hop. Related: #755

- Add type annotations for id_col union type - Handle None cases for _node, _source, _destination bindings - Add type narrowing assertions for node vs edge cases - Use type: ignore for runtime-added gfql method Fixes python-lint-types CI failures on PR #815. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Use noqa: E712 comments for pandas boolean comparisons where == True/False is necessary for testing exact values (numpy bool != Python bool identity). Also simplified some assertions to use truthiness where appropriate. Fixes CI python-lint-types failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move docs/design/mark-mode-design.md to plans/feat-755-mark-mode/design.md - Design docs should be in plans/, not committed to master

- Keep essential intro, parameters, and schema info - Reduce from 5 examples to 2 focused examples - Simplify use cases to single line - Remove verbose comparison table - Target: ~50-70 lines (achieved: 58 lines)

- gfql: Validate list contains dicts with 'type' key (AST objects) - gfql: Validate dict contains 'chain' key (Chain JSON) - engine: Validate value is in valid engine set (pandas/cudf/dask/dask_cudf/auto) - Prevents malformed parameters from reaching execution

- Change name parameter from required to optional (defaults to None) - Generate default names: 'is_matched_node' or 'is_matched_edge' - Update safelist to accept list of AST objects or Chain JSON - Update tests to verify default name generation - Update documentation to reflect optional parameter Addresses user feedback: 'let's not make 'name' required, there's probably some reasonable default name we can use'

- Add is_engine_param() validator accepting both EngineAbstract enum and strings - Update all operations with engine parameters to use new validator: * get_degrees, mark, materialize_nodes, hop, fa2_layout * group_in_a_box_layout, hypergraph - Maintains backward compatibility with string literals - Note: umap 'engine' param uses different values (umap backend), left as-is Addresses user feedback: 'engine isn't just string, we probably have the same issue in others too? should accept EngineAbstract or str. Audit other operations for the same issue.'

- Fix node matching: Convert matched IDs to list for cross-engine isin() - Fix edge matching: Replace tuple-based with merge-based approach - Add index preservation: __row_idx__ + sort_values + reset_index - Add test_mark_cudf.py with 9 comprehensive GPU tests - Tests pass: 26 pandas + 9 cuDF = 35 total - Verified with NVIDIA GeForce RTX 3080 Ti via Docker Related: #755, PR #815

Plans directory is gitignored and should not be tracked. File remains in local plans/ directory but will not be pushed to remote. This fixes the mistake from commit where the file was incorrectly added to git tracking despite plans/ being in .gitignore.

lmeyerov · 2025-10-20T04:25:54Z

revisit later, see #820

lmeyerov and others added 10 commits October 19, 2025 09:17

refactor: Move design doc to plans/ directory

bea794f

- Move docs/design/mark-mode-design.md to plans/feat-755-mark-mode/design.md - Design docs should be in plans/, not committed to master

docs: Add mark() feature to CHANGELOG.md Development section

8718fcf

docs: Reduce mark() documentation from 135 to 58 lines

327e441

- Keep essential intro, parameters, and schema info - Reduce from 5 examples to 2 focused examples - Simplify use cases to single line - Remove verbose comparison table - Target: ~50-70 lines (achieved: 58 lines)

lmeyerov force-pushed the feat/755-mark-mode branch from e788de3 to 8464b81 Compare October 19, 2025 17:14

lmeyerov added 4 commits October 19, 2025 10:22

lmeyerov closed this Oct 20, 2025

lmeyerov deleted the feat/755-mark-mode branch October 20, 2025 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gfql): Implement mark() operation for pattern annotation (#755) #815

feat(gfql): Implement mark() operation for pattern annotation (#755) #815

lmeyerov commented Oct 19, 2025

Uh oh!

lmeyerov commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(gfql): Implement mark() operation for pattern annotation (#755) #815

feat(gfql): Implement mark() operation for pattern annotation (#755) #815

Conversation

lmeyerov commented Oct 19, 2025

Summary

Implementation

Core Features

Files Changed

Examples

Basic Marking

Accumulating Marks

In let() DAG

With Visualization

Testing

Design Decisions

Documentation

Commits

Checklist

Uh oh!

lmeyerov commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant