-
Notifications
You must be signed in to change notification settings - Fork 42
Modeling designs page review #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments, but one overall thing is the structure. It is common practice to start with the smallest building blocks and then build up, meaning that I would start with presenting the different structures and then put them together, rather than starting with a graph consisting of multiple structures.
This is important content, thank you for curating it @lidiazuin °
However, Neo4j allows you to effortlessly adjust detailed and broad changes across pieces or the entirety of the graph. | ||
Whether it is small changes over time or a broad definition that includes a variety of needed information about your entities, the database is able to handle it. | ||
It is simply up to the developers and architects to determine the structure of the data model and how to define entities for queries. | ||
The example shows how different structures can be combined into one graph, and how different types of questions can be answered by one single graph if you use the correct design, depending on the question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bit awkward, different types of questions can be answered by a single graph if you use the correct design, depending on the question.
One of the earliest decisions you may encounter is whether to model something as a property on a node or as a relationship to a separate node. | ||
Take, for example, the data below modeling a movie genre as a property on the `Movie` node. | ||
A monopartite graph structure consists of a single set of nodes with a single label. | ||
Most algorithms rely on this type of graph and they are very common when using spanning trees, link:{docs-home}/cypher-manual/current/patterns/shortest-paths/[shortest paths], and link:{docs-home}/graph-data-science/current/algorithms/community/[community detection]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shortest path can be run on any type of graph, including bipartite and multipartite ones. Maybe use another example, like PageRank or Eigenvector?
Co-authored-by: Jessica Wright <49636617+AlexicaWright@users.noreply.github.com>
This is an improvement, but the model is still not optimal and should be iterated on. | ||
When changes are done to your model, it is important to keep track of them by versioning your model. | ||
|
||
// Content on versioning is WIP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to #465
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done! I like the new structure! Left some more comments, sorry.. ;)
|
||
If you plan to do analysis on individual items and return only details about that entity (like genres on a particular movie), then the first data model would serve perfectly well for your needs. | ||
However, if you need to run analysis to find common ground between entities or look at a group of nodes, then the second data model would definitely improve performance of those types of queries. | ||
Instead, you could either turn the `Role` node into a property of the `WORKED_AT` relationship or use an *intermediate node* between the `Person`, `Company`, and `Role` nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we talk about hyperedges and not about creating relationships from relationships, I think we should stick to using an example that shows a relationship between more than two nodes.
If so, being able to use more than one data model is a great solution! | ||
The division and length of the time periods are set based on the context of the question you need to answer. | ||
On the top of the timeline tree is the "all time node" that represents the entire timeline. | ||
The timeline is then divided into relevant time periods, represented by the nodes below the all time node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're missing an important piece of information here. This (from the KB article):
The rest of the data nodes, which are non-time data nodes, are the nodes that contain the important pieces data in the graph. These nodes link into the timeline tree at the appropriate leaf node.
With this xref:data-modeling/tutorial-refactoring.adoc[refactored design], you can look only at the relationships for dates you care about and then scan for the relevant airline through the `airline` property in the `Flight` node. | ||
|
||
This is an improvement, but the model is still not optimal and should be iterated on. | ||
When changes are done to your model, it is important to keep track of them by versioning your model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to keep track of versions of your model?
The example shows how different structures can be combined into one graph, and how different types of questions can be answered by one single graph if the modeling is done efficiently. | ||
It also shows the many ways nodes and structures can span out. | ||
|
||
=== Monopartite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a little abrupt? Maybe it can be introduced with a little sentence about different graph structures or something?
Co-authored-by: Jessica Wright <49636617+AlexicaWright@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful content overall. I just see an issue with the time tree, see comments
However, Neo4j allows you to effortlessly adjust detailed and broad changes across pieces or the entirety of the graph. | ||
Whether it is small changes over time or a broad definition that includes a variety of needed information about your entities, the database is able to handle it. | ||
It is simply up to the developers and architects to determine the structure of the data model and how to define entities for queries. | ||
image::hyperedge.svg[An example of a hyperedge in which a relationship is connected to two nodes, a feature not available in Neo4j,width=400,role=popup] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why there is a KNOWS
relationship here whereas in the other diagram below there is a IN_ROLE
rel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have been fixed yet. Is this the image that you are seeing? https://github.com/neo4j/docs-getting-started/pull/451/files#diff-c7e86415c640b7192119c42108fda32b36bb6cba201e6f5ee9d80b1a63a0b021
Each section of the class (or offering) would then become an instance of the course. | ||
Linked lists are commonly used in computer science and they are particularly useful whenever the sequence of objects matters. | ||
In this data structure, a simple-linked list is where each node links to the next node only, whereas in a double-linked list, each node links both to the next and the previous node. | ||
Neo4j *does not* support double-linked lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, nothing prevents from doing this.
However this is redundant info, as one relationship implies the other. They are semantically equivalent.
There is no performance difference traversing in both directions.
I would rephrase at this is not recommended.
=== Interleaved linked list | ||
|
||
There are different ways to sequence a list of items. | ||
When nodes are connected in a nonlinear fashion, they are referred to as an interleaved linked list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I get the question.
I don't see this as linear or not. Both relationship types just represent different information, sharing pieces of content
The tradeoff is that now you will need to maintain two models. | ||
Each time you create a new node or relationship or update pieces of the graph, you will need to make changes to accommodate both models. | ||
This can also impact query performance, as you might have double the syntax needed to update each model. | ||
== Timeline tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole time tree section needs to be removed.
It used to be useful when neo4j did not have indexes on dates. This is not the case anymore. This pattern is obsolete.
Not sure why it ended up in the PS KB in the first place.
|
||
If you were, instead, working with a data model that uses time as the navigator instead of as the anchor, this same question would require a great deal of property lookups and a lot of inefficient gather-and-inspect. | ||
|
||
== Time-bound data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for time tree, this is not true anymore now that date indexes are available
The example shows how different structures can be combined into one graph, and how different types of questions can be answered by one single graph if the modeling is done efficiently. | ||
It also shows the many ways nodes and structures can span out. | ||
|
||
=== Monopartite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure monopartite / bipartite / multipartite brings value here.
I don't see this as a pattern as people will choose one or the other representation according to the use case. There is no benefit of one over the others. It would see this more as a GDS topic, where a given shape must be used according to the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there, Submit review.. ;)
|
||
If you plan to do analysis on individual items and return only details about that entity (like genres on a particular movie), then the first data model would serve perfectly well for your needs. | ||
However, if you need to run analysis to find common ground between entities or look at a group of nodes, then the second data model would definitely improve performance of those types of queries. | ||
Instead, you could either turn the `Role` node into a property of the `WORKED_AT` relationship or use an *intermediate node* between the `Person`, `Company`, and `Role` nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to illustrate a hyperedge, the same relationship needs to be used to connect one node to two or more nodes.
So you need to make the WORKED_AT
and IN_ROLE
one and the same relationship and it needs to connect the person node to both the company and the role nodes.
.Graph model of movies and their genres | ||
image::modeling_genre_node-arr.svg[role="popup-link",400,400] | ||
The use of intermediary nodes can also answer the question "Who worked at the same company at the same time?" as the added employment event contains information about when each individual worked at a certain company. | ||
A `MATCH` clause would show that Patrick and David both worked at Acme, being colleagues from 2004 to 2005 since their employment events overlap during that time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have an example of this MATCH
clause?
It is much simpler than our earlier version because it uses a natural, graph pattern (entity-relationship-entity) to find the information needed. | ||
First, Cypher finds a movie and the genre it is related to, then looks for a second movie that is in that same genre. | ||
Intemediate nodes can also add value to a model by providing a way to share data and thus reduce duplicate information. | ||
In this example, Sarah sends an email to Lucy and copies David and Claire to it: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could mention that the content of the email is a property on every relationship?
(m2:Movie)-[:IN_GENRE]->(g) | ||
RETURN m1, m2, g | ||
---- | ||
By using a fan-out, duplication can be reduced as a property (`content`) is broken out and made into its own node (`Email` with the property `content`) instead of being repeated, in this case, on every relationship: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using a fan-out, duplication can be reduced as a property (`content`) is broken out and made into its own node (`Email` with the property `content`) instead of being repeated, in this case, on every relationship: | |
If you instead fan out the the model, you reduce duplication by breaking out the property `content` from all relationships and turning it into the intermediary node `Email` instead. |
If you plan to do analysis on individual items and return only details about that entity (like genres on a particular movie), then the first data model would serve perfectly well for your needs. | ||
However, if you need to run analysis to find common ground between entities or look at a group of nodes, then the second data model would definitely improve performance of those types of queries. | ||
Once the property value `content` is moved to a single node `Email`, it can be referenced via relationships with the `User` nodes that previously held that value. | ||
Now there are no copies or duplications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now there are no copies or duplications. | |
Now there are no duplications. |
Inspecting the several properties of each `Flight` node could be expensive on resources. | ||
* The order in which the episodes were aired using the `NEXT` relationship and through a simple-linked list. | ||
* The order in which the episodes were produced using the `NEXT_IN_PRODUCTION` relationship, which creates an interleaved linked list. | ||
It is not a linear list, as it goes 1, 3, 2, 5, 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still linear since it goes from one node to the next in a linear fashion. That the sequence is not in numerical order is not the same as non-linear.
@@ -47,7 +47,7 @@ Keep in mind that graph data modeling is an iterative process. | |||
Your initial graph data model is only a starting point. | |||
As you learn more about your use cases or if they change, the model needs to adapt. | |||
|
|||
Additionally, you may find that, especially when the graph scales, you need to xref:data-modeling/graph-model-refactoring.adoc[refactor] your model to ensure it is aligned with your business needs as they evolve. | |||
Additionally, you may find that, especially when the graph scales, you need to xref:data-modeling/tutorial-refactoring.adoc[refactor] your model to ensure it is aligned with your business needs as they evolve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a section here called How to create a graph data model when that has a whole own page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reformulated this section to avoid the redundancy
Co-authored-by: Jessica Wright <49636617+AlexicaWright@users.noreply.github.com>
…tarted into modelingdesigns
This PR includes documentation updates Updated pages: |
No description provided.