Skip to content

Modeling designs page review #451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from
Open

Conversation

lidiazuin
Copy link
Contributor

No description provided.

@lidiazuin lidiazuin marked this pull request as ready for review April 7, 2025 13:25
Copy link
Contributor

@AlexicaWright AlexicaWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, but one overall thing is the structure. It is common practice to start with the smallest building blocks and then build up, meaning that I would start with presenting the different structures and then put them together, rather than starting with a graph consisting of multiple structures.
This is important content, thank you for curating it @lidiazuin °

However, Neo4j allows you to effortlessly adjust detailed and broad changes across pieces or the entirety of the graph.
Whether it is small changes over time or a broad definition that includes a variety of needed information about your entities, the database is able to handle it.
It is simply up to the developers and architects to determine the structure of the data model and how to define entities for queries.
The example shows how different structures can be combined into one graph, and how different types of questions can be answered by one single graph if you use the correct design, depending on the question.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bit awkward, different types of questions can be answered by a single graph if you use the correct design, depending on the question.

One of the earliest decisions you may encounter is whether to model something as a property on a node or as a relationship to a separate node.
Take, for example, the data below modeling a movie genre as a property on the `Movie` node.
A monopartite graph structure consists of a single set of nodes with a single label.
Most algorithms rely on this type of graph and they are very common when using spanning trees, link:{docs-home}/cypher-manual/current/patterns/shortest-paths/[shortest paths], and link:{docs-home}/graph-data-science/current/algorithms/community/[community detection].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortest path can be run on any type of graph, including bipartite and multipartite ones. Maybe use another example, like PageRank or Eigenvector?

lidiazuin and others added 2 commits April 8, 2025 12:37
Co-authored-by: Jessica Wright <49636617+AlexicaWright@users.noreply.github.com>
@lidiazuin lidiazuin requested a review from AlexicaWright April 8, 2025 12:06
This is an improvement, but the model is still not optimal and should be iterated on.
When changes are done to your model, it is important to keep track of them by versioning your model.

// Content on versioning is WIP.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to #465

@lidiazuin lidiazuin requested a review from nmervaillie April 8, 2025 13:45
Copy link
Contributor

@AlexicaWright AlexicaWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! I like the new structure! Left some more comments, sorry.. ;)


If you plan to do analysis on individual items and return only details about that entity (like genres on a particular movie), then the first data model would serve perfectly well for your needs.
However, if you need to run analysis to find common ground between entities or look at a group of nodes, then the second data model would definitely improve performance of those types of queries.
Instead, you could either turn the `Role` node into a property of the `WORKED_AT` relationship or use an *intermediate node* between the `Person`, `Company`, and `Role` nodes:
Copy link
Contributor

@AlexicaWright AlexicaWright May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we talk about hyperedges and not about creating relationships from relationships, I think we should stick to using an example that shows a relationship between more than two nodes.

If so, being able to use more than one data model is a great solution!
The division and length of the time periods are set based on the context of the question you need to answer.
On the top of the timeline tree is the "all time node" that represents the entire timeline.
The timeline is then divided into relevant time periods, represented by the nodes below the all time node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing an important piece of information here. This (from the KB article):
The rest of the data nodes, which are non-time data nodes, are the nodes that contain the important pieces data in the graph. These nodes link into the timeline tree at the appropriate leaf node.

With this xref:data-modeling/tutorial-refactoring.adoc[refactored design], you can look only at the relationships for dates you care about and then scan for the relevant airline through the `airline` property in the `Flight` node.

This is an improvement, but the model is still not optimal and should be iterated on.
When changes are done to your model, it is important to keep track of them by versioning your model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to keep track of versions of your model?

The example shows how different structures can be combined into one graph, and how different types of questions can be answered by one single graph if the modeling is done efficiently.
It also shows the many ways nodes and structures can span out.

=== Monopartite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little abrupt? Maybe it can be introduced with a little sentence about different graph structures or something?

Co-authored-by: Jessica Wright <49636617+AlexicaWright@users.noreply.github.com>
Copy link
Member

@nmervaillie nmervaillie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful content overall. I just see an issue with the time tree, see comments

However, Neo4j allows you to effortlessly adjust detailed and broad changes across pieces or the entirety of the graph.
Whether it is small changes over time or a broad definition that includes a variety of needed information about your entities, the database is able to handle it.
It is simply up to the developers and architects to determine the structure of the data model and how to define entities for queries.
image::hyperedge.svg[An example of a hyperedge in which a relationship is connected to two nodes, a feature not available in Neo4j,width=400,role=popup]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why there is a KNOWS relationship here whereas in the other diagram below there is a IN_ROLE rel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each section of the class (or offering) would then become an instance of the course.
Linked lists are commonly used in computer science and they are particularly useful whenever the sequence of objects matters.
In this data structure, a simple-linked list is where each node links to the next node only, whereas in a double-linked list, each node links both to the next and the previous node.
Neo4j *does not* support double-linked lists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, nothing prevents from doing this.
However this is redundant info, as one relationship implies the other. They are semantically equivalent.
There is no performance difference traversing in both directions.
I would rephrase at this is not recommended.

=== Interleaved linked list

There are different ways to sequence a list of items.
When nodes are connected in a nonlinear fashion, they are referred to as an interleaved linked list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I get the question.
I don't see this as linear or not. Both relationship types just represent different information, sharing pieces of content

The tradeoff is that now you will need to maintain two models.
Each time you create a new node or relationship or update pieces of the graph, you will need to make changes to accommodate both models.
This can also impact query performance, as you might have double the syntax needed to update each model.
== Timeline tree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole time tree section needs to be removed.
It used to be useful when neo4j did not have indexes on dates. This is not the case anymore. This pattern is obsolete.
Not sure why it ended up in the PS KB in the first place.


If you were, instead, working with a data model that uses time as the navigator instead of as the anchor, this same question would require a great deal of property lookups and a lot of inefficient gather-and-inspect.

== Time-bound data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for time tree, this is not true anymore now that date indexes are available

The example shows how different structures can be combined into one graph, and how different types of questions can be answered by one single graph if the modeling is done efficiently.
It also shows the many ways nodes and structures can span out.

=== Monopartite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure monopartite / bipartite / multipartite brings value here.
I don't see this as a pattern as people will choose one or the other representation according to the use case. There is no benefit of one over the others. It would see this more as a GDS topic, where a given shape must be used according to the algorithm.

Copy link
Contributor

@AlexicaWright AlexicaWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there, Submit review.. ;)


If you plan to do analysis on individual items and return only details about that entity (like genres on a particular movie), then the first data model would serve perfectly well for your needs.
However, if you need to run analysis to find common ground between entities or look at a group of nodes, then the second data model would definitely improve performance of those types of queries.
Instead, you could either turn the `Role` node into a property of the `WORKED_AT` relationship or use an *intermediate node* between the `Person`, `Company`, and `Role` nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to illustrate a hyperedge, the same relationship needs to be used to connect one node to two or more nodes.
So you need to make the WORKED_AT and IN_ROLE one and the same relationship and it needs to connect the person node to both the company and the role nodes.

.Graph model of movies and their genres
image::modeling_genre_node-arr.svg[role="popup-link",400,400]
The use of intermediary nodes can also answer the question "Who worked at the same company at the same time?" as the added employment event contains information about when each individual worked at a certain company.
A `MATCH` clause would show that Patrick and David both worked at Acme, being colleagues from 2004 to 2005 since their employment events overlap during that time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have an example of this MATCH clause?

It is much simpler than our earlier version because it uses a natural, graph pattern (entity-relationship-entity) to find the information needed.
First, Cypher finds a movie and the genre it is related to, then looks for a second movie that is in that same genre.
Intemediate nodes can also add value to a model by providing a way to share data and thus reduce duplicate information.
In this example, Sarah sends an email to Lucy and copies David and Claire to it:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could mention that the content of the email is a property on every relationship?

(m2:Movie)-[:IN_GENRE]->(g)
RETURN m1, m2, g
----
By using a fan-out, duplication can be reduced as a property (`content`) is broken out and made into its own node (`Email` with the property `content`) instead of being repeated, in this case, on every relationship:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By using a fan-out, duplication can be reduced as a property (`content`) is broken out and made into its own node (`Email` with the property `content`) instead of being repeated, in this case, on every relationship:
If you instead fan out the the model, you reduce duplication by breaking out the property `content` from all relationships and turning it into the intermediary node `Email` instead.

If you plan to do analysis on individual items and return only details about that entity (like genres on a particular movie), then the first data model would serve perfectly well for your needs.
However, if you need to run analysis to find common ground between entities or look at a group of nodes, then the second data model would definitely improve performance of those types of queries.
Once the property value `content` is moved to a single node `Email`, it can be referenced via relationships with the `User` nodes that previously held that value.
Now there are no copies or duplications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now there are no copies or duplications.
Now there are no duplications.

Inspecting the several properties of each `Flight` node could be expensive on resources.
* The order in which the episodes were aired using the `NEXT` relationship and through a simple-linked list.
* The order in which the episodes were produced using the `NEXT_IN_PRODUCTION` relationship, which creates an interleaved linked list.
It is not a linear list, as it goes 1, 3, 2, 5, 4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still linear since it goes from one node to the next in a linear fashion. That the sequence is not in numerical order is not the same as non-linear.

@@ -47,7 +47,7 @@ Keep in mind that graph data modeling is an iterative process.
Your initial graph data model is only a starting point.
As you learn more about your use cases or if they change, the model needs to adapt.

Additionally, you may find that, especially when the graph scales, you need to xref:data-modeling/graph-model-refactoring.adoc[refactor] your model to ensure it is aligned with your business needs as they evolve.
Additionally, you may find that, especially when the graph scales, you need to xref:data-modeling/tutorial-refactoring.adoc[refactor] your model to ensure it is aligned with your business needs as they evolve.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a section here called How to create a graph data model when that has a whole own page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reformulated this section to avoid the redundancy

lidiazuin and others added 4 commits June 5, 2025 16:21
@neo4j-docops-agent
Copy link
Collaborator

This PR includes documentation updates
View the updated docs at https://neo4j-docs-getting-started-451.surge.sh

Updated pages:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants