Create explicit WCC endpoints #859

DarthMax · 2025-03-28T16:19:25Z

Fixes GDSA-47

This PR showcases the changes we would like to do in order to support running all algorithms via Arrow instead of Bolt for GDS Sessions.

It covers 3 things:

Introduction of explicit algorithm endpoints, e.g. a well defined method for wcc.write that exposes all available parameters. This improves discoverability and coding UX and makes it easier to implement different back ends
A Cypher based implementation for these endpoints
An arrow based implementation for these endpoints

While this should be mostly transparent from a user perspective this PR proposes some API changes that would be breaking, see the comments

netlify · 2025-03-28T16:19:43Z

✅ Deploy Preview for neo4j-graph-data-science-client canceled.

Name	Link
🔨 Latest commit	`3bb24d8`
🔍 Latest deploy log	https://app.netlify.com/sites/neo4j-graph-data-science-client/deploys/67e6cc10019b490008bf9a0f

netlify · 2025-06-24T20:35:54Z

✅ Deploy Preview for neo4j-graph-data-science-client canceled.

Name	Link
🔨 Latest commit	`ae2f8cb`
🔍 Latest deploy log	https://app.netlify.com/projects/neo4j-graph-data-science-client/deploys/686403023670ee000890b6b1

DarthMax · 2025-06-24T20:43:16Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+        seed_property: Optional[str] = None,
+        consecutive_ids: Optional[bool] = None,
+        relationship_weight_property: Optional[str] = None,
+    ) -> WccMutateResult:


This constitutes a breaking change. Before we would have returned a Dataframe.
However I think this API is much easier to use as it is more strict on the types and should enabled better IDE auto completion.

i also like the typed api more :)

especially if we also offer a to_pandas utility (using the from_dict utility).
mainly useful for display in jupyter notebooks

I also prefer the typed API. I don't like breaking things, but I do like having a result type that we own and control, and especially if it comes with static fields to help the user.

DarthMax · 2025-06-24T20:44:01Z

graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py

+            computation_result["preProcessingMillis"],
+            computation_result["computeMillis"],
+            computation_result["postProcessingMillis"],
+            0,


TODO: measure the actual time

Is this mutateMillis?

Are we not measuring this already server-side?

DarthMax · 2025-06-24T20:44:56Z

graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py

+    def write(
+        self,
+        G: Graph,
+        write_property: str,


TODO: respect that property
This requires a change in the gds.arrow.write procedure in the database

graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py

FlorentinD

Great start!

left some comments.
i still tend to think we want to start a feature branch for this to work towards a 2.0 version of the client

FlorentinD · 2025-06-30T12:05:06Z

graphdatascience/arrow_client/authenticated_arrow_client.py

+from graphdatascience.arrow_client.arrow_authentication import ArrowAuthentication
+from graphdatascience.arrow_client.arrow_info import ArrowInfo
+from graphdatascience.retry_utils.retry_config import RetryConfig
+
+from ..retry_utils.retry_utils import before_log
+from ..version import __version__


NIT: mixing import styles

FlorentinD · 2025-06-30T12:11:41Z

graphdatascience/arrow_client/data_mapper.py

+    def dict_to_dataclass(data: Dict[str, Any], cls: Type[T], strict: bool = False) -> T:
+        """
+        Convert a dictionary to a dataclass instance with nested dataclass support.
+        """
+        if not dataclasses.is_dataclass(cls):
+            raise ValueError(f"{cls} is not a dataclass")
+
+        field_dict = {f.name: f for f in fields(cls)}
+        filtered_data = {}
+
+        for key, value in data.items():
+            if key in field_dict:
+                field = field_dict[key]
+                field_type = field.type
+
+                # Handle nested dataclasses
+                if dataclasses.is_dataclass(field_type) and isinstance(value, dict):
+                    filtered_data[key] = DataMapper.dict_to_dataclass(value, field_type, strict)  # type:ignore
+                else:
+                    filtered_data[key] = value
+            elif strict:
+                raise ValueError(f"Extra field '{key}' not allowed in {cls.__name__}")
+
+        return cls(**filtered_data)  # type: ignore


for this validation we could think of also using pydantic instead of rolling our custom validation

graphdatascience/graph_data_science.py

FlorentinD · 2025-06-30T12:19:45Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+        seed_property: Optional[str] = None,
+        consecutive_ids: Optional[bool] = None,
+        relationship_weight_property: Optional[str] = None,
+    ) -> WccMutateResult:


i also like the typed api more :)

especially if we also offer a to_pandas utility (using the from_dict utility).
mainly useful for display in jupyter notebooks

FlorentinD · 2025-06-30T12:21:18Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+from pandas import DataFrame
+
+from ...graph.graph_object import Graph
+


NIT: package name we could rather have endpoints

FlorentinD · 2025-06-30T12:21:37Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+@dataclass(frozen=True, repr=True)
+class WccMutateResult:
+    component_count: int
+    component_distribution: dict[str, Any]
+    pre_processing_millis: int
+    compute_millis: int
+    post_processing_millis: int
+    mutate_millis: int
+    node_properties_written: int
+    configuration: dict[str, Any]


would be neat to generate these :)

Do you mean generating the result class? Do you mean generating them statically before release, or generating them at runtime?

Or do you mean generating configuration classes, and using that instead of the dict[str, Any]?

i was thinking statically. checked-in code even and a check that they are up-to-date.
Ideally both but will see if its worth it.

So you mean the result class, and not configuration classes then?

graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py

graphdatascience/procedure_surface/cypher/wcc_proc_runner.py

DarthMax · 2025-07-01T15:51:08Z

I added a feature flag that enables the explicit APIs

Mats-SX · 2025-07-09T09:26:29Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+    def mutate(
+        self,
+        G: Graph,
+        mutate_property: str,
+        threshold: Optional[float] = None,
+        relationship_types: Optional[List[str]] = None,
+        node_labels: Optional[List[str]] = None,
+        sudo: Optional[bool] = None,
+        log_progress: Optional[bool] = None,
+        username: Optional[str] = None,
+        concurrency: Optional[Any] = None,
+        job_id: Optional[Any] = None,
+        seed_property: Optional[str] = None,
+        consecutive_ids: Optional[bool] = None,
+        relationship_weight_property: Optional[str] = None,


This is a very different approach from the design principle we used for the 1.x client. Diametrically different. It is somewhat risky. I think we should hand out an early version of this API to our internal stakeholders in the field to gather feedback.

Personally, I like the explicit and pythonic design. I believe it will be easier to use on its own. It will be somewhat more expensive to maintain and we will have to do more releases in general, but those costs will be worth it if our user base agrees with our perception that this is a better API.

Mats-SX · 2025-07-09T09:29:50Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+        """
+        Executes the WCC algorithm and writes the results to the in-memory graph as node properties.
+
+        Parameters
+        ----------
+        G : Graph
+            The graph to run the algorithm on
+        mutate_property : str
+            The property name to store the component ID for each node
+        threshold : Optional[float], default=None
+            The minimum required weight to consider a relationship during traversal
+        relationship_types : Optional[List[str]], default=None
+            The relationship types to project
+        node_labels : Optional[List[str]], default=None
+            The node labels to project
+        sudo : Optional[bool], default=None
+            Run analysis with admin permission
+        log_progress : Optional[bool], default=None
+            Whether to log progress
+        username : Optional[str], default=None
+            The username to attribute the procedure run to
+        concurrency : Optional[Any], default=None
+            The number of concurrent threads
+        job_id : Optional[Any], default=None
+            An identifier for the job
+        seed_property : Optional[str], default=None
+            Defines node properties that are used as initial component identifiers
+        consecutive_ids : Optional[bool], default=None
+            Flag to decide whether component identifiers are mapped into a consecutive id space
+        relationship_weight_property : Optional[str], default=None
+            The property name that contains weight
+
+        Returns
+        -------
+        WccMutateResult
+            Algorithm metrics and statistics
+        """
+        pass


I suppose this is the "extended documentation burden" that I was highlighting in the ad-hoc discussion. It isn't nothing, but it is co-located in code and requires no additional tooling over the sphinx setup that we already have.

While we will still need to do non-trivial changes to the asciidoc parts of the client manual to align its content with the new API design, I do agree that the majority of change is contained to things like this.

Of course, the co-location of docs and code does mean that a docbug cannot be fixed without a code release. That can create some annoying situations which also will contribute to an increased release frequency.

For example, if we have the wrong description for a parameter here, we could only correct it with a new release of the client itself.

Well, our online docs we could regenerate and republish, but the inline code IDE help would require the Python package release.

This code was actually AI generated, so the initial burden is somewhat lessened. It is not perfect and needs to be improved. I wonder if we could move that documentation from our docs into the actual java code that defines the parameters. That way we could pull it in more easily when generating the clients

Mats-SX · 2025-07-09T09:39:55Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+@dataclass(frozen=True, repr=True)
+class WccStatsResult:
+    component_count: int
+    component_distribution: dict[str, Any]
+    pre_processing_millis: int
+    compute_millis: int
+    post_processing_millis: int
+    configuration: dict[str, Any]
+
+
+@dataclass(frozen=True, repr=True)
+class WccWriteResult:
+    component_count: int
+    component_distribution: dict[str, Any]
+    pre_processing_millis: int
+    compute_millis: int
+    write_millis: int
+    post_processing_millis: int
+    node_properties_written: int
+    configuration: dict[str, Any]


We could consider some static code reuse scheme here, like inheritance, for shared fields.
Or we just have good tests to make sure we don't misspell or forget any field.
We used class inheritance for these classes in Java, but while it gave us type safety it does convolute the code when each level just adds one or two fields and one has to navigate to super classes.
The latest approach of using Java records that implement Java interfaces to make sure that shared field names are the same and are present is one I like quite a lot. I wonder if we can do a similar thing in Python.

The general API design here does follow a structured scheme though:

write extends stats, writeCommon
mutate extends stats, mutateCommon

stats extends statsCommon

algoStats extends stats, algoCommon
algoWrite extends algoStats, writeCommon
algoMutate extends algoStats, mutateCommon

But this is mostly a code hygiene effort that can be maintained in multiple ways.

Since these do not change very often any more, if ever. I think i prefer to keep the separate. We could consider some type of static code analysis to make sure things are consistent.

Mats-SX

In general I like this approach and I believe it will lead to a good product.

I'm a little concerned that we are coupling the two, potentially orthogonal, topics of

support calling operations on Arrow V2 endpoints
deliver a more Pythonic API experience

when we could deliver only one of them instead. I don't see why these are logically necessary to do both at the same time.

Perhaps the idea is pragmatic? "The cost of doing X on its own is C, the cost of doing Y on its own is D, but the cost of doing X and Y at the same time is E < C + D?

When we couple the two, we make the 'cons' of the either affect both -- for example, users cannot get Arrow V2 without accepting the Python API experience. And the Arrow V2 support is not a feature for the user, just an infrastructure change. Maybe with better stability, but otherwise an 'invisible' change.

Mats-SX · 2025-07-09T09:43:22Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+@dataclass(frozen=True, repr=True)
+class WccStatsResult:
+    component_count: int
+    component_distribution: dict[str, Any]
+    pre_processing_millis: int
+    compute_millis: int
+    post_processing_millis: int
+    configuration: dict[str, Any]
+
+
+@dataclass(frozen=True, repr=True)
+class WccWriteResult:
+    component_count: int
+    component_distribution: dict[str, Any]
+    pre_processing_millis: int
+    compute_millis: int
+    write_millis: int
+    post_processing_millis: int
+    node_properties_written: int
+    configuration: dict[str, Any]


The general API design here does follow a structured scheme though:

write extends stats, writeCommon
mutate extends stats, mutateCommon

stats extends statsCommon

algoStats extends stats, algoCommon
algoWrite extends algoStats, writeCommon
algoMutate extends algoStats, mutateCommon

But this is mostly a code hygiene effort that can be maintained in multiple ways.

graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py

graphdatascience/procedure_surface/cypher/wcc_proc_runner.py

Mats-SX · 2025-07-09T09:57:41Z

graphdatascience/procedure_surface/api/wcc_endpoints.py

+class WccStatsResult:
+    component_count: int
+    component_distribution: dict[str, Any]
+    pre_processing_millis: int
+    compute_millis: int
+    post_processing_millis: int
+    configuration: dict[str, Any]


I understand what you mean here with being able to accept camel-cased subscripting to access these fields similar to before.

But I think that we shouldn't do that unless we also allow the camel-cased inputs. And I think we shouldn't do either of those things unless we also make sure to make the whole API non-breaking.

graphdatascience/procedure_surface/config_converter.py

DarthMax added 7 commits June 18, 2025 16:47

Introduce explict endpoints for Wcc

4246b30

Implement Wcc endpoints using Cypher procedures

c0edacc

Move arrow client related code into its own package

50d88ab

Return custom type from wcc.mutate

6c39bc0

Introduce AuthenticatedArrowClient

be2985e

Implement Arrow based wcc endpoints

877427a

Fix formatting

8554f63

DarthMax force-pushed the create_explicit_procedure_endpoints_for_wcc branch from 3bb24d8 to 8554f63 Compare June 24, 2025 20:35

DarthMax commented Jun 24, 2025

View reviewed changes

DarthMax added 2 commits June 27, 2025 11:13

Fix Cypher wcc endpoint tests

235fd9e

Fix type issues

d8f221c

FlorentinD reviewed Jun 30, 2025

View reviewed changes

DarthMax added 3 commits July 1, 2025 11:56

Generalize config extraction for arrow endpoints

03eb011

Use config converter also for Cypher endpoints

ed37a4d

Add feature flag to enabled explicit APIs

ae2f8cb

Mats-SX reviewed Jul 9, 2025

View reviewed changes

graphdatascience/procedure_surface/config_converter.py Show resolved Hide resolved

		from pandas import DataFrame

		from ...graph.graph_object import Graph

Create explicit WCC endpoints #859

Are you sure you want to change the base?

Create explicit WCC endpoints #859

Uh oh!

Conversation

DarthMax commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for neo4j-graph-data-science-client canceled.

Uh oh!

netlify bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for neo4j-graph-data-science-client canceled.

Uh oh!

DarthMax Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarthMax Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FlorentinD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FlorentinD Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DarthMax commented Jul 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mats-SX left a comment

Choose a reason for hiding this comment

Uh oh!

DarthMax commented Mar 28, 2025 •

edited

Loading

netlify bot commented Mar 28, 2025 •

edited

Loading

netlify bot commented Jun 24, 2025 •

edited

Loading

DarthMax Jun 24, 2025 •

edited

Loading

DarthMax Jun 24, 2025 •

edited

Loading

FlorentinD Jul 9, 2025 •

edited

Loading