-
Notifications
You must be signed in to change notification settings - Fork 54
Create explicit WCC endpoints #859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Create explicit WCC endpoints #859
Conversation
✅ Deploy Preview for neo4j-graph-data-science-client canceled.
|
3bb24d8
to
8554f63
Compare
✅ Deploy Preview for neo4j-graph-data-science-client canceled.
|
seed_property: Optional[str] = None, | ||
consecutive_ids: Optional[bool] = None, | ||
relationship_weight_property: Optional[str] = None, | ||
) -> WccMutateResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constitutes a breaking change. Before we would have returned a Dataframe.
However I think this API is much easier to use as it is more strict on the types and should enabled better IDE auto completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also like the typed api more :)
especially if we also offer a to_pandas utility (using the from_dict utility).
mainly useful for display in jupyter notebooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer the typed API. I don't like breaking things, but I do like having a result type that we own and control, and especially if it comes with static fields to help the user.
computation_result["preProcessingMillis"], | ||
computation_result["computeMillis"], | ||
computation_result["postProcessingMillis"], | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: measure the actual time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mutateMillis
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not measuring this already server-side?
def write( | ||
self, | ||
G: Graph, | ||
write_property: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: respect that property
This requires a change in the gds.arrow.write procedure in the database
graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start!
left some comments.
i still tend to think we want to start a feature branch for this to work towards a 2.0 version of the client
from graphdatascience.arrow_client.arrow_authentication import ArrowAuthentication | ||
from graphdatascience.arrow_client.arrow_info import ArrowInfo | ||
from graphdatascience.retry_utils.retry_config import RetryConfig | ||
|
||
from ..retry_utils.retry_utils import before_log | ||
from ..version import __version__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: mixing import styles
def dict_to_dataclass(data: Dict[str, Any], cls: Type[T], strict: bool = False) -> T: | ||
""" | ||
Convert a dictionary to a dataclass instance with nested dataclass support. | ||
""" | ||
if not dataclasses.is_dataclass(cls): | ||
raise ValueError(f"{cls} is not a dataclass") | ||
|
||
field_dict = {f.name: f for f in fields(cls)} | ||
filtered_data = {} | ||
|
||
for key, value in data.items(): | ||
if key in field_dict: | ||
field = field_dict[key] | ||
field_type = field.type | ||
|
||
# Handle nested dataclasses | ||
if dataclasses.is_dataclass(field_type) and isinstance(value, dict): | ||
filtered_data[key] = DataMapper.dict_to_dataclass(value, field_type, strict) # type:ignore | ||
else: | ||
filtered_data[key] = value | ||
elif strict: | ||
raise ValueError(f"Extra field '{key}' not allowed in {cls.__name__}") | ||
|
||
return cls(**filtered_data) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this validation we could think of also using pydantic instead of rolling our custom validation
seed_property: Optional[str] = None, | ||
consecutive_ids: Optional[bool] = None, | ||
relationship_weight_property: Optional[str] = None, | ||
) -> WccMutateResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also like the typed api more :)
especially if we also offer a to_pandas utility (using the from_dict utility).
mainly useful for display in jupyter notebooks
from pandas import DataFrame | ||
|
||
from ...graph.graph_object import Graph | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: package name we could rather have endpoints
@dataclass(frozen=True, repr=True) | ||
class WccMutateResult: | ||
component_count: int | ||
component_distribution: dict[str, Any] | ||
pre_processing_millis: int | ||
compute_millis: int | ||
post_processing_millis: int | ||
mutate_millis: int | ||
node_properties_written: int | ||
configuration: dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be neat to generate these :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean generating the result class? Do you mean generating them statically before release, or generating them at runtime?
Or do you mean generating configuration classes, and using that instead of the dict[str, Any]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking statically. checked-in code even and a check that they are up-to-date.
Ideally both but will see if its worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean the result class, and not configuration classes then?
graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py
Outdated
Show resolved
Hide resolved
I added a feature flag that enables the explicit APIs |
def mutate( | ||
self, | ||
G: Graph, | ||
mutate_property: str, | ||
threshold: Optional[float] = None, | ||
relationship_types: Optional[List[str]] = None, | ||
node_labels: Optional[List[str]] = None, | ||
sudo: Optional[bool] = None, | ||
log_progress: Optional[bool] = None, | ||
username: Optional[str] = None, | ||
concurrency: Optional[Any] = None, | ||
job_id: Optional[Any] = None, | ||
seed_property: Optional[str] = None, | ||
consecutive_ids: Optional[bool] = None, | ||
relationship_weight_property: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very different approach from the design principle we used for the 1.x client. Diametrically different. It is somewhat risky. I think we should hand out an early version of this API to our internal stakeholders in the field to gather feedback.
Personally, I like the explicit and pythonic design. I believe it will be easier to use on its own. It will be somewhat more expensive to maintain and we will have to do more releases in general, but those costs will be worth it if our user base agrees with our perception that this is a better API.
""" | ||
Executes the WCC algorithm and writes the results to the in-memory graph as node properties. | ||
|
||
Parameters | ||
---------- | ||
G : Graph | ||
The graph to run the algorithm on | ||
mutate_property : str | ||
The property name to store the component ID for each node | ||
threshold : Optional[float], default=None | ||
The minimum required weight to consider a relationship during traversal | ||
relationship_types : Optional[List[str]], default=None | ||
The relationship types to project | ||
node_labels : Optional[List[str]], default=None | ||
The node labels to project | ||
sudo : Optional[bool], default=None | ||
Run analysis with admin permission | ||
log_progress : Optional[bool], default=None | ||
Whether to log progress | ||
username : Optional[str], default=None | ||
The username to attribute the procedure run to | ||
concurrency : Optional[Any], default=None | ||
The number of concurrent threads | ||
job_id : Optional[Any], default=None | ||
An identifier for the job | ||
seed_property : Optional[str], default=None | ||
Defines node properties that are used as initial component identifiers | ||
consecutive_ids : Optional[bool], default=None | ||
Flag to decide whether component identifiers are mapped into a consecutive id space | ||
relationship_weight_property : Optional[str], default=None | ||
The property name that contains weight | ||
|
||
Returns | ||
------- | ||
WccMutateResult | ||
Algorithm metrics and statistics | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is the "extended documentation burden" that I was highlighting in the ad-hoc discussion. It isn't nothing, but it is co-located in code and requires no additional tooling over the sphinx setup that we already have.
While we will still need to do non-trivial changes to the asciidoc parts of the client manual to align its content with the new API design, I do agree that the majority of change is contained to things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, the co-location of docs and code does mean that a docbug cannot be fixed without a code release. That can create some annoying situations which also will contribute to an increased release frequency.
For example, if we have the wrong description for a parameter here, we could only correct it with a new release of the client itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, our online docs we could regenerate and republish, but the inline code IDE help would require the Python package release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was actually AI generated, so the initial burden is somewhat lessened. It is not perfect and needs to be improved. I wonder if we could move that documentation from our docs into the actual java code that defines the parameters. That way we could pull it in more easily when generating the clients
@dataclass(frozen=True, repr=True) | ||
class WccStatsResult: | ||
component_count: int | ||
component_distribution: dict[str, Any] | ||
pre_processing_millis: int | ||
compute_millis: int | ||
post_processing_millis: int | ||
configuration: dict[str, Any] | ||
|
||
|
||
@dataclass(frozen=True, repr=True) | ||
class WccWriteResult: | ||
component_count: int | ||
component_distribution: dict[str, Any] | ||
pre_processing_millis: int | ||
compute_millis: int | ||
write_millis: int | ||
post_processing_millis: int | ||
node_properties_written: int | ||
configuration: dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider some static code reuse scheme here, like inheritance, for shared fields.
Or we just have good tests to make sure we don't misspell or forget any field.
We used class inheritance for these classes in Java, but while it gave us type safety it does convolute the code when each level just adds one or two fields and one has to navigate to super classes.
The latest approach of using Java record
s that implement Java interface
s to make sure that shared field names are the same and are present is one I like quite a lot. I wonder if we can do a similar thing in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general API design here does follow a structured scheme though:
write extends stats, writeCommon
mutate extends stats, mutateCommon
stats extends statsCommon
algoStats extends stats, algoCommon
algoWrite extends algoStats, writeCommon
algoMutate extends algoStats, mutateCommon
But this is mostly a code hygiene effort that can be maintained in multiple ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these do not change very often any more, if ever. I think i prefer to keep the separate. We could consider some type of static code analysis to make sure things are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like this approach and I believe it will lead to a good product.
I'm a little concerned that we are coupling the two, potentially orthogonal, topics of
- support calling operations on Arrow V2 endpoints
- deliver a more Pythonic API experience
when we could deliver only one of them instead. I don't see why these are logically necessary to do both at the same time.
Perhaps the idea is pragmatic? "The cost of doing X
on its own is C
, the cost of doing Y
on its own is D
, but the cost of doing X
and Y
at the same time is E < C + D
?
When we couple the two, we make the 'cons' of the either affect both -- for example, users cannot get Arrow V2 without accepting the Python API experience. And the Arrow V2 support is not a feature for the user, just an infrastructure change. Maybe with better stability, but otherwise an 'invisible' change.
@dataclass(frozen=True, repr=True) | ||
class WccStatsResult: | ||
component_count: int | ||
component_distribution: dict[str, Any] | ||
pre_processing_millis: int | ||
compute_millis: int | ||
post_processing_millis: int | ||
configuration: dict[str, Any] | ||
|
||
|
||
@dataclass(frozen=True, repr=True) | ||
class WccWriteResult: | ||
component_count: int | ||
component_distribution: dict[str, Any] | ||
pre_processing_millis: int | ||
compute_millis: int | ||
write_millis: int | ||
post_processing_millis: int | ||
node_properties_written: int | ||
configuration: dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general API design here does follow a structured scheme though:
write extends stats, writeCommon
mutate extends stats, mutateCommon
stats extends statsCommon
algoStats extends stats, algoCommon
algoWrite extends algoStats, writeCommon
algoMutate extends algoStats, mutateCommon
But this is mostly a code hygiene effort that can be maintained in multiple ways.
graphdatascience/procedure_surface/arrow/wcc_arrow_endpoints.py
Outdated
Show resolved
Hide resolved
class WccStatsResult: | ||
component_count: int | ||
component_distribution: dict[str, Any] | ||
pre_processing_millis: int | ||
compute_millis: int | ||
post_processing_millis: int | ||
configuration: dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you mean here with being able to accept camel-cased subscripting to access these fields similar to before.
But I think that we shouldn't do that unless we also allow the camel-cased inputs. And I think we shouldn't do either of those things unless we also make sure to make the whole API non-breaking.
Fixes GDSA-47
This PR showcases the changes we would like to do in order to support running all algorithms via Arrow instead of Bolt for GDS Sessions.
It covers 3 things:
While this should be mostly transparent from a user perspective this PR proposes some API changes that would be breaking, see the comments