ECO: Add C++ Graph Edit Distance implementation with MCS optimization#3813
ECO: Add C++ Graph Edit Distance implementation with MCS optimization#3813alirezazd wants to merge 12 commits intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
- Move graph data structures: graph.h/cpp -> graph.h/cc - Move GXL parser: gxl_parser.h/cpp -> gxl_parser.h/cc - Move LAP solver: lap.h/cc -> lap_solver.h/cc - Move GED algorithm: ged.h/cc - Move MCS algorithm: mcs.h/cpp -> mcs.h/cc - Move cost functions: cost_functions.h -> ged_cost_functions.h - Move main binary: main.cpp -> ged_main.cc - Move IR patch gen: ir_patch_gen.h/cc - Update all header guards to XLS_ECO_* format - Update all includes to use xls/eco/ prefix - Rename .cpp -> .cc for XLS convention This flattens the directory structure and makes the codebase more consistent with XLS conventions. The mcs_ged/ directory now only contains BUILD file and third-party libs/.
- Move tinyxml2 files into libs/tinyxml2/ subdirectory - Add scipy_lap in libs/scipy_lap/ subdirectory - Update BUILD paths for new lib structure - Each third-party lib now has its own directory
- graph.h/cc: XLSGraph, XLSNode, XLSEdge for representing XLS IR as graphs - gxl_parser.h/cc: Parse GXL (Graph eXchange Language) format - Foundation for cpp graph edit distance algorithm
- lap_solver.cc: Wrapper for scipy LAP solver - Used by GED algorithm for optimal node matching
- ged.h/cc: Core GED algorithm with cost matrix and LAP integration - Supports node/edge substitution, insertion, deletion costs - Handles pinned nodes (nodes that must match)
- Add cc_library targets: graph, gxl_parser, lap_solver, ged, mcs - Add cc_binary: ged_main - Add ir_patch_gen_lib target - All GED infrastructure now buildable
- mcs.h/cc: Find maximum common subgraph between two graphs - Preprocessing step that accelerates GED by identifying node correspondences - Reduces GED search space significantly
- ged_main.cc: CLI tool to compute GED between two GXL graphs - Supports MCS preprocessing option - Outputs edit operations and cost
- ir_patch_gen.h/cc: Convert GED results to IR patch protobuf - ir_patch.proto: Updated proto definitions - patch_ir.h/cc/main.cc: Apply patches to XLS IR - ir_patch_gen.py: Python utilities for patch generation
- Replace requirement() with @xls_pip_deps// for old Python tools - Update Receive node construction to match new upstream API signature - Add payload_type parameter required by new Receive constructor
This commit contains two types of changes: 1. Upstream updates to ir2nx.py merged during rebase (large diff) 2. Our deprecation TODOs added to Python GED toolchain: - ir_diff.py: NetworkX GED → ged.h/cc (10-100x faster) - ir_diff_main.py: Python tool → ged_main.cc CLI - ir_patch_gen.py: Python patcher → patch_ir.h/cc - ir2nx.py: NetworkX converter → xls_ir_to_cytoscape.cc - ir2gxl.py: GXL exporter → gxl_parser.h/cc (from upstream)
grebe
left a comment
There was a problem hiding this comment.
First of all: thank you! This is exciting to see. Looks like a lot of good work and a big improvement.
Here's some initial feedback, I'm still working my way through. I think I have enough here to get the ball rolling, though, so I'm sending it out. Feel free to reach out directly or go back and forth here if anything I said was unclear or if you disagree.
There was a problem hiding this comment.
Deps shouldn't be under xls/, they should be managed from dependency_support/ or (ideally) BCR. tinyxml2 looks like it's in BCR. scipy_lsap.cpp is a bit trickier b/c our python deps aren't super well integrated into bazel (it's essentially pip installed, which afaik largely hides the internals). We typically try to avoid holding on to third_party/ source in here. I believe OR-tools has some implementations of linear assignment stuff and we already pull it in as a dep - is a pain to move to using it?
| # 2. replace top receive nodes with the after ir ones while keeping their ids | ||
| # 3. run lec manually and it should pass. | ||
| # 4. TODO: Treat channles as nodes and automatically re-generate channels if needed. | ||
| xls_patch_ir( |
There was a problem hiding this comment.
If this fails, perhaps it should be commented out? We typically want test targets to pass unless we're testing that something fails
| patch_out = ctx.actions.declare_file(ctx.label.name + ".patch.bin") | ||
|
|
||
| # before.ir -> before.gxl | ||
| ctx.actions.run_shell( |
There was a problem hiding this comment.
Does this need to be run_shell? I believe run is generally preferred. Similar below
| ) | ||
|
|
||
| return [ | ||
| # Match XLS style: expose the generated patch and source IRs. |
There was a problem hiding this comment.
Not clear to me that this particularly matches XLS style, comment seems superfluous
| #include "xls/eco/graph.h" | ||
|
|
||
| namespace ged | ||
| { |
| "Timeout in seconds for the GED search (0 = Return the initial " | ||
| "solution immediately). Short alias: " | ||
| "-t."); | ||
| ABSL_FLAG( |
There was a problem hiding this comment.
Is this necessary / does it not conflict with the -v the logging library already has? I think this should not be needed.
|
|
||
| namespace { | ||
|
|
||
| std::vector<std::string> CanonicalizeArgs(int argc, char* argv[]) { |
There was a problem hiding this comment.
We generally leave abseil arg parsing to its own devices and we typically do not have short args
| } | ||
|
|
||
| absl::Status RealMain(const std::vector<std::string_view>& positional_args) { | ||
| std::string before_ir = absl::GetFlag(FLAGS_before_ir); |
There was a problem hiding this comment.
Best practice (that we don't always follow) is to absl::GetFlag() the flag in main() and pass them as args to RealMain().
| } | ||
|
|
||
| // TODO(DELETE): Print final stats for paper | ||
| std::cerr << "\n========== PAPER STATS (TODO: DELETE) ==========\n"; |
There was a problem hiding this comment.
Consider keeping this and having it be a flag that dumps stats to a file or logs them or something. Maybe the struct should have a ToString() that does all this.
| @@ -0,0 +1,442 @@ | |||
| #include "xls/eco/graph.h" | |||
There was a problem hiding this comment.
I'm realizing now that all these files should have license headers (see other files for examples)
This PR adds a C++ implementation of Graph Edit Distance (GED) for the ECO system, replacing the existing Python/NetworkX approach.
What's included
ged_main) and IR patch generationRegarding the new C++ GED
The Python version works but is slow on large graphs. This C++ implementation is faster, integrates directly with XLS IR APIs, and supports advanced features like pinned nodes and MCS optimization.
Migration
Python tools (ir_diff_main.py, ir2gxl.py) are marked for deprecation but still functional. They'll be removed in a future PR once the C++ version is proven stable.
Testing
Builds cleanly and includes tests with real DSLX examples (crc32, apfloat_fmac, riscv_simple, etc).
@grebe Please consider reviewing this PR. I tried to organize it into separate commits for convenience.