Skip to content

Create explicit WCC endpoints #859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

DarthMax
Copy link
Contributor

@DarthMax DarthMax commented Mar 28, 2025

Fixes GDSA-47

This PR showcases the changes we would like to do in order to support running all algorithms via Arrow instead of Bolt for GDS Sessions.

It covers 3 things:

  • Introduction of explicit algorithm endpoints, e.g. a well defined method for wcc.write that exposes all available parameters. This improves discoverability and coding UX and makes it easier to implement different back ends
  • A Cypher based implementation for these endpoints
  • An arrow based implementation for these endpoints

While this should be mostly transparent from a user perspective this PR proposes some API changes that would be breaking, see the comments

Copy link

netlify bot commented Mar 28, 2025

Deploy Preview for neo4j-graph-data-science-client canceled.

Name Link
🔨 Latest commit 3bb24d8
🔍 Latest deploy log https://app.netlify.com/sites/neo4j-graph-data-science-client/deploys/67e6cc10019b490008bf9a0f

@DarthMax DarthMax force-pushed the create_explicit_procedure_endpoints_for_wcc branch from 3bb24d8 to 8554f63 Compare June 24, 2025 20:35
Copy link

netlify bot commented Jun 24, 2025

Deploy Preview for neo4j-graph-data-science-client canceled.

Name Link
🔨 Latest commit ae2f8cb
🔍 Latest deploy log https://app.netlify.com/projects/neo4j-graph-data-science-client/deploys/686403023670ee000890b6b1

seed_property: Optional[str] = None,
consecutive_ids: Optional[bool] = None,
relationship_weight_property: Optional[str] = None,
) -> WccMutateResult:
Copy link
Contributor Author

@DarthMax DarthMax Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constitutes a breaking change. Before we would have returned a Dataframe.
However I think this API is much easier to use as it is more strict on the types and should enabled better IDE auto completion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also like the typed api more :)

especially if we also offer a to_pandas utility (using the from_dict utility).
mainly useful for display in jupyter notebooks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer the typed API. I don't like breaking things, but I do like having a result type that we own and control, and especially if it comes with static fields to help the user.

computation_result["preProcessingMillis"],
computation_result["computeMillis"],
computation_result["postProcessingMillis"],
0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: measure the actual time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mutateMillis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not measuring this already server-side?

def write(
self,
G: Graph,
write_property: str,
Copy link
Contributor Author

@DarthMax DarthMax Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: respect that property
This requires a change in the gds.arrow.write procedure in the database

Copy link
Contributor

@FlorentinD FlorentinD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!

left some comments.
i still tend to think we want to start a feature branch for this to work towards a 2.0 version of the client

Comment on lines +20 to +25
from graphdatascience.arrow_client.arrow_authentication import ArrowAuthentication
from graphdatascience.arrow_client.arrow_info import ArrowInfo
from graphdatascience.retry_utils.retry_config import RetryConfig

from ..retry_utils.retry_utils import before_log
from ..version import __version__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: mixing import styles

Comment on lines +32 to +55
def dict_to_dataclass(data: Dict[str, Any], cls: Type[T], strict: bool = False) -> T:
"""
Convert a dictionary to a dataclass instance with nested dataclass support.
"""
if not dataclasses.is_dataclass(cls):
raise ValueError(f"{cls} is not a dataclass")

field_dict = {f.name: f for f in fields(cls)}
filtered_data = {}

for key, value in data.items():
if key in field_dict:
field = field_dict[key]
field_type = field.type

# Handle nested dataclasses
if dataclasses.is_dataclass(field_type) and isinstance(value, dict):
filtered_data[key] = DataMapper.dict_to_dataclass(value, field_type, strict) # type:ignore
else:
filtered_data[key] = value
elif strict:
raise ValueError(f"Extra field '{key}' not allowed in {cls.__name__}")

return cls(**filtered_data) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this validation we could think of also using pydantic instead of rolling our custom validation

seed_property: Optional[str] = None,
consecutive_ids: Optional[bool] = None,
relationship_weight_property: Optional[str] = None,
) -> WccMutateResult:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also like the typed api more :)

especially if we also offer a to_pandas utility (using the from_dict utility).
mainly useful for display in jupyter notebooks

from pandas import DataFrame

from ...graph.graph_object import Graph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: package name we could rather have endpoints

Comment on lines +245 to +254
@dataclass(frozen=True, repr=True)
class WccMutateResult:
component_count: int
component_distribution: dict[str, Any]
pre_processing_millis: int
compute_millis: int
post_processing_millis: int
mutate_millis: int
node_properties_written: int
configuration: dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be neat to generate these :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean generating the result class? Do you mean generating them statically before release, or generating them at runtime?

Or do you mean generating configuration classes, and using that instead of the dict[str, Any]?

Copy link
Contributor

@FlorentinD FlorentinD Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking statically. checked-in code even and a check that they are up-to-date.
Ideally both but will see if its worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the result class, and not configuration classes then?

@DarthMax
Copy link
Contributor Author

DarthMax commented Jul 1, 2025

I added a feature flag that enables the explicit APIs

Comment on lines +18 to +32
def mutate(
self,
G: Graph,
mutate_property: str,
threshold: Optional[float] = None,
relationship_types: Optional[List[str]] = None,
node_labels: Optional[List[str]] = None,
sudo: Optional[bool] = None,
log_progress: Optional[bool] = None,
username: Optional[str] = None,
concurrency: Optional[Any] = None,
job_id: Optional[Any] = None,
seed_property: Optional[str] = None,
consecutive_ids: Optional[bool] = None,
relationship_weight_property: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very different approach from the design principle we used for the 1.x client. Diametrically different. It is somewhat risky. I think we should hand out an early version of this API to our internal stakeholders in the field to gather feedback.

Personally, I like the explicit and pythonic design. I believe it will be easier to use on its own. It will be somewhat more expensive to maintain and we will have to do more releases in general, but those costs will be worth it if our user base agrees with our perception that this is a better API.

Comment on lines +34 to +71
"""
Executes the WCC algorithm and writes the results to the in-memory graph as node properties.

Parameters
----------
G : Graph
The graph to run the algorithm on
mutate_property : str
The property name to store the component ID for each node
threshold : Optional[float], default=None
The minimum required weight to consider a relationship during traversal
relationship_types : Optional[List[str]], default=None
The relationship types to project
node_labels : Optional[List[str]], default=None
The node labels to project
sudo : Optional[bool], default=None
Run analysis with admin permission
log_progress : Optional[bool], default=None
Whether to log progress
username : Optional[str], default=None
The username to attribute the procedure run to
concurrency : Optional[Any], default=None
The number of concurrent threads
job_id : Optional[Any], default=None
An identifier for the job
seed_property : Optional[str], default=None
Defines node properties that are used as initial component identifiers
consecutive_ids : Optional[bool], default=None
Flag to decide whether component identifiers are mapped into a consecutive id space
relationship_weight_property : Optional[str], default=None
The property name that contains weight

Returns
-------
WccMutateResult
Algorithm metrics and statistics
"""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is the "extended documentation burden" that I was highlighting in the ad-hoc discussion. It isn't nothing, but it is co-located in code and requires no additional tooling over the sphinx setup that we already have.

While we will still need to do non-trivial changes to the asciidoc parts of the client manual to align its content with the new API design, I do agree that the majority of change is contained to things like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, the co-location of docs and code does mean that a docbug cannot be fixed without a code release. That can create some annoying situations which also will contribute to an increased release frequency.

For example, if we have the wrong description for a parameter here, we could only correct it with a new release of the client itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, our online docs we could regenerate and republish, but the inline code IDE help would require the Python package release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was actually AI generated, so the initial burden is somewhat lessened. It is not perfect and needs to be improved. I wonder if we could move that documentation from our docs into the actual java code that defines the parameters. That way we could pull it in more easily when generating the clients

Comment on lines 257 to 276
@dataclass(frozen=True, repr=True)
class WccStatsResult:
component_count: int
component_distribution: dict[str, Any]
pre_processing_millis: int
compute_millis: int
post_processing_millis: int
configuration: dict[str, Any]


@dataclass(frozen=True, repr=True)
class WccWriteResult:
component_count: int
component_distribution: dict[str, Any]
pre_processing_millis: int
compute_millis: int
write_millis: int
post_processing_millis: int
node_properties_written: int
configuration: dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider some static code reuse scheme here, like inheritance, for shared fields.
Or we just have good tests to make sure we don't misspell or forget any field.
We used class inheritance for these classes in Java, but while it gave us type safety it does convolute the code when each level just adds one or two fields and one has to navigate to super classes.
The latest approach of using Java records that implement Java interfaces to make sure that shared field names are the same and are present is one I like quite a lot. I wonder if we can do a similar thing in Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general API design here does follow a structured scheme though:

write extends stats, writeCommon
mutate extends stats, mutateCommon

stats extends statsCommon

algoStats extends stats, algoCommon
algoWrite extends algoStats, writeCommon
algoMutate extends algoStats, mutateCommon

But this is mostly a code hygiene effort that can be maintained in multiple ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these do not change very often any more, if ever. I think i prefer to keep the separate. We could consider some type of static code analysis to make sure things are consistent.

Copy link
Contributor

@Mats-SX Mats-SX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like this approach and I believe it will lead to a good product.

I'm a little concerned that we are coupling the two, potentially orthogonal, topics of

  • support calling operations on Arrow V2 endpoints
  • deliver a more Pythonic API experience

when we could deliver only one of them instead. I don't see why these are logically necessary to do both at the same time.

Perhaps the idea is pragmatic? "The cost of doing X on its own is C, the cost of doing Y on its own is D, but the cost of doing X and Y at the same time is E < C + D?

When we couple the two, we make the 'cons' of the either affect both -- for example, users cannot get Arrow V2 without accepting the Python API experience. And the Arrow V2 support is not a feature for the user, just an infrastructure change. Maybe with better stability, but otherwise an 'invisible' change.

Comment on lines 257 to 276
@dataclass(frozen=True, repr=True)
class WccStatsResult:
component_count: int
component_distribution: dict[str, Any]
pre_processing_millis: int
compute_millis: int
post_processing_millis: int
configuration: dict[str, Any]


@dataclass(frozen=True, repr=True)
class WccWriteResult:
component_count: int
component_distribution: dict[str, Any]
pre_processing_millis: int
compute_millis: int
write_millis: int
post_processing_millis: int
node_properties_written: int
configuration: dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general API design here does follow a structured scheme though:

write extends stats, writeCommon
mutate extends stats, mutateCommon

stats extends statsCommon

algoStats extends stats, algoCommon
algoWrite extends algoStats, writeCommon
algoMutate extends algoStats, mutateCommon

But this is mostly a code hygiene effort that can be maintained in multiple ways.

Comment on lines +258 to +264
class WccStatsResult:
component_count: int
component_distribution: dict[str, Any]
pre_processing_millis: int
compute_millis: int
post_processing_millis: int
configuration: dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean here with being able to accept camel-cased subscripting to access these fields similar to before.

But I think that we shouldn't do that unless we also allow the camel-cased inputs. And I think we shouldn't do either of those things unless we also make sure to make the whole API non-breaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants