Skip to content

Conversation

lmeyerov
Copy link
Contributor

Summary

Implements mark mode for GFQL that annotates nodes/edges with boolean columns without filtering, enabling multi-stage pattern detection and visualization of match vs non-match entities.

Closes #755

Implementation

Core Features

  • Instance method: g.mark(gfql=[n({'type': 'person'})], name='is_person')
  • Call operation: g.gfql(call('mark', {'gfql': [n()], 'name': 'hit'}))
  • Boolean columns: True for matches, False for non-matches (all entities preserved)
  • Target inference: Automatically determines nodes vs edges from GFQL pattern
  • Multi-format support: Handles Chain, List[ASTObject], and JSON GFQL (remote execution)

Files Changed

  • graphistry/compute/mark.py: Core implementation with full validation
  • graphistry/compute/ComputeMixin.py: Method wrapper integration
  • graphistry/compute/gfql/call_safelist.py: Safelist registration for call('mark')
  • docs/source/gfql/builtin_calls.rst: Comprehensive documentation
  • docs/design/mark-mode-design.md: Full design document
  • graphistry/tests/compute/test_mark.py: 24 comprehensive tests

Examples

Basic Marking

# Mark VIP customers - all nodes preserved
g2 = g.mark(gfql=[n({'customer_type': 'VIP'})], name='is_vip')
# Result: g2._nodes has 'is_vip' column (True for VIPs, False for others)

Accumulating Marks

g2 = g.mark(gfql=[n({'type': 'person'})], name='is_person')
g3 = g2.mark(gfql=[n({'region': 'EMEA'})], name='is_emea')
# Result: Both 'is_person' and 'is_emea' columns present

In let() DAG

g.gfql(let({
    'marked_people': call('mark', {
        'gfql': [n({'type': 'person'})],
        'name': 'is_person'
    }),
    'marked_vips': ref('marked_people', [
        call('mark', {
            'gfql': [n({'vip': True})],
            'name': 'is_vip'
        })
    ]),
    # Filter to VIPs
    'vips': ref('marked_vips', [n({'is_vip': True})])
}))

With Visualization

g.gfql([
    call('mark', {'gfql': [n({'risk': gt(0.8)})], 'name': 'high_risk'}),
    call('encode_point_color', {
        'column': 'high_risk',
        'categorical_mapping': {True: 'red', False: 'green'}
    })
])

Testing

24 comprehensive unit tests covering:

  • ✅ Safelist registration and parameter validation
  • ✅ Basic node and edge marking
  • ✅ All-match and no-match scenarios
  • ✅ Multiple mark accumulation
  • call('mark') execution with exact graph verification
  • ✅ JSON GFQL deserialization for remote execution
  • let() DAG composition with marks
  • ✅ Edge cases (empty graphs, no nodes/edges)

All tests passing: 24/24

Enhanced test rigor for call('mark'):

  • Verifies exact boolean values for specific entities
  • Verifies all other columns preserved
  • Verifies edges unchanged (pd.testing.assert_frame_equal)
  • Confirms JSON and list GFQL produce identical results

Design Decisions

  1. Column naming: User-provided names, error on collision (not auto-increment)
  2. Boolean values: True/False (not True/NaN) for cleaner downstream use
  3. Target inference: Single method, infers nodes vs edges from GFQL pattern
  4. let() semantics: Marks accumulate across bindings naturally

Documentation

  • ✅ Full API documentation in docstrings
  • ✅ Design document: docs/design/mark-mode-design.md
  • ✅ Reference documentation: docs/source/gfql/builtin_calls.rst
  • ✅ 5 detailed examples in docs
  • ✅ Use cases: fraud detection, security analysis, social networks

Commits

  1. 49b6617e - feat(gfql): Implement mark() operation
  2. e99cc957 - test(gfql): Add comprehensive tests for mark()
  3. 4d2babac - test(gfql): Enhance call('mark') test rigor
  4. 987c2d05 - docs(gfql): Add mark() to builtin calls reference

Checklist

  • Implementation complete
  • All tests passing (24/24)
  • Documentation added
  • Design document created
  • Safelist registration
  • Remote execution support (JSON GFQL)
  • Error handling and validation
  • Examples provided

lmeyerov added a commit that referenced this pull request Oct 19, 2025
- Add type annotations for id_col union type
- Handle None cases for _node, _source, _destination bindings
- Add type narrowing assertions for node vs edge cases
- Use type: ignore for runtime-added gfql method

Fixes python-lint-types CI failures on PR #815.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
lmeyerov and others added 10 commits October 19, 2025 09:17
Add mark mode for GFQL that annotates nodes/edges with boolean
columns without filtering, enabling multi-stage pattern detection
and visualization of matches.

Implementation:
- Add mark() method to ComputeMixin with full validation
- Register mark in call_safelist.py for call('mark') support
- Handle GFQL execution (Chain/list/JSON) with proper type coercion
- Create boolean columns: True for matches, False for non-matches
- Support both node and edge marking inferred from GFQL pattern
- Comprehensive parameter validation and error messages

Design:
- User-provided column names (error on collision)
- Marks accumulate across let() bindings
- Validated against internal column name conflicts

Files:
- graphistry/compute/mark.py: Core implementation
- graphistry/compute/ComputeMixin.py: Method wrapper
- graphistry/compute/gfql/call_safelist.py: Safelist registration
- docs/design/mark-mode-design.md: Full design document

Example usage:
  g.mark(gfql=[n({'type': 'person'})], name='is_person')
  g.gfql(call('mark', {'gfql': [n()], 'name': 'hit'}))

Related: #755
Add 24 unit tests covering all mark() functionality:
- Safelist registration and parameter validation
- Basic node and edge marking
- Boolean column creation (True/False values)
- All-match and no-match scenarios
- Parameter validation (type checking, empty values, collisions)
- Internal column name protection
- Multiple mark accumulation
- call('mark') execution
- JSON GFQL deserialization for remote execution
- let() DAG composition with marks
- Chain integration
- Edge cases (no nodes, no edges)

Fixes for test failures:
- Added empty gfql validation before chain access
- Convert ValueError to GFQLSchemaError for internal column conflicts
- Updated test assertions to match actual error messages

All tests passing: 24/24

Related: #755
Add more comprehensive assertions to call() tests:
- Verify exact boolean values for specific nodes (not just counts)
- Verify other columns are preserved
- Verify edges are unchanged (pd.testing.assert_frame_equal)
- Compare JSON vs list GFQL forms produce identical results

This addresses testing gap where call() tests were less rigorous
than direct mark() tests. Now we have high confidence that
call('mark') returns the exact expected graph structure.

All tests still passing: 24/24
Add comprehensive documentation for the new mark() call operation:
- Parameter table with types and requirements
- Five detailed examples showing various use cases
- Use cases for fraud detection, security, social network analysis
- Comparison table: mark vs filter operations
- Schema effects and implementation notes
- Integration with let(), visualization, and multi-mark workflows

Placed in 'Filtering and Transformation Methods' section
after filter_edges_by_dict and before hop.

Related: #755
- Add type annotations for id_col union type
- Handle None cases for _node, _source, _destination bindings
- Add type narrowing assertions for node vs edge cases
- Use type: ignore for runtime-added gfql method

Fixes python-lint-types CI failures on PR #815.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Use noqa: E712 comments for pandas boolean comparisons where == True/False
is necessary for testing exact values (numpy bool != Python bool identity).

Also simplified some assertions to use truthiness where appropriate.

Fixes CI python-lint-types failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move docs/design/mark-mode-design.md to plans/feat-755-mark-mode/design.md
- Design docs should be in plans/, not committed to master
- Keep essential intro, parameters, and schema info
- Reduce from 5 examples to 2 focused examples
- Simplify use cases to single line
- Remove verbose comparison table
- Target: ~50-70 lines (achieved: 58 lines)
- gfql: Validate list contains dicts with 'type' key (AST objects)
- gfql: Validate dict contains 'chain' key (Chain JSON)
- engine: Validate value is in valid engine set (pandas/cudf/dask/dask_cudf/auto)
- Prevents malformed parameters from reaching execution
- Change name parameter from required to optional (defaults to None)
- Generate default names: 'is_matched_node' or 'is_matched_edge'
- Update safelist to accept list of AST objects or Chain JSON
- Update tests to verify default name generation
- Update documentation to reflect optional parameter

Addresses user feedback: 'let's not make 'name' required, there's
probably some reasonable default name we can use'
- Add is_engine_param() validator accepting both EngineAbstract enum and strings
- Update all operations with engine parameters to use new validator:
  * get_degrees, mark, materialize_nodes, hop, fa2_layout
  * group_in_a_box_layout, hypergraph
- Maintains backward compatibility with string literals
- Note: umap 'engine' param uses different values (umap backend), left as-is

Addresses user feedback: 'engine isn't just string, we probably have
the same issue in others too? should accept EngineAbstract or str.
Audit other operations for the same issue.'
- Fix node matching: Convert matched IDs to list for cross-engine isin()
- Fix edge matching: Replace tuple-based with merge-based approach
- Add index preservation: __row_idx__ + sort_values + reset_index
- Add test_mark_cudf.py with 9 comprehensive GPU tests
- Tests pass: 26 pandas + 9 cuDF = 35 total
- Verified with NVIDIA GeForce RTX 3080 Ti via Docker

Related: #755, PR #815
Plans directory is gitignored and should not be tracked.
File remains in local plans/ directory but will not be pushed to remote.

This fixes the mistake from commit where the file was incorrectly
added to git tracking despite plans/ being in .gitignore.
@lmeyerov
Copy link
Contributor Author

revisit later, see #820

@lmeyerov lmeyerov closed this Oct 20, 2025
@lmeyerov lmeyerov deleted the feat/755-mark-mode branch October 20, 2025 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support marking mode in addition to filtering mode for ASTRef chain operations

1 participant