Authors:
-
Stefan Plantikow <stefan.plantikow@neo4j.com>
-
Mats Rydberg <mats.rydberg@neo4j.com>
-
Petra Selmer <petra.selmer@neo4j.com>
This proposal is a response to CIR-2017-174.
Currently Cypher uses pattern matching semantics that treats all patterns that occur in a MATCH
clause as a unit (called a uniqueness scope) and only considers pattern instances that bind different relationships to each fixed length relationship pattern variable and to each element of a variable length relationship pattern variable.
This has come to be called Cypermorphism informally and is a form of edge isomorphism that is based on Cypher’s notion of uniqueness scope.
Cyphermorphism lies at the intersection of returning as many results as possible while still ruling out returning an infinite number of paths when matching graphs that contain cycles with a variable length pattern.
However, the notion of uniqueness scope has proven to be non-standard and is occasionally confusing for users and cyphermorphic matching is not tractable in terms of computational complexity for some graphs.
The CIP aims to address these issues.
This CIP is based upon the notions of walks, trails and paths, as defined by Jungnickel [Jung10]. We note that walks, trails and paths are not universally used, and correspond to paths, simple paths and tidy paths as defined by Rosen [Rose02]. However, we use Jungnickel’s definitions as the terms lend themselves better from a syntactic perspective.
We define and provide an example for each of these terms below. In general, we note that every path is a trail, and every trail is a walk.
Assume graph G
is given as follows:
For the remainder of this CIP, we use the notation r-
to indicate that a relationship r
was traversed in the inverse direction; i.e. that r
was traversed from its target node to its source node.
A walk is an arbitrary, non-empty sequence of alternating nodes and relationships that starts with a node and ends with a node. Within a walk, both nodes and edges can be duplicated. A walk whose start node and end node is the same is called a closed walk.
A valid walk in G
is given by the sequence S1
: B, t, C, v, D, w, E, w-, D, v-, C, t-, B, s-, A
.
We note that S1
contains the duplicated nodes B
, C
and D
, and the duplicated relationships t
, v
and w
.
Moreover, S1
is not a closed walk, as B
(the start node) ≠ A
(the end node).
A walk corresponds to the standard definition of homomorphic graph pattern matching.
A trail is a walk that does not contain any duplicate relationships.
A valid trail in G
is given by the sequence S2
: B, t, C, v, D, w, E, x, F, y-, D, u, C
.
We note that S2
contains the duplicated notes C
and D
, and no duplicate relationships.
Moreover, S2
is not a closed trail, as B
(the start node) ≠ C
(the end node).
S2
is also a walk.
A path corresponds to the standard definition of (relationship-)isomorphic graph pattern matching as well as the notion of Cyphermorphic graph pattern matching.
A path is a trail that does not contain the same node twice unless that node is both the start node and the end node of the path.
A valid path in G
is given by the sequence S3
: B, t, C, v, D, y, F, z, A, s, B
.
We note that S3
contains no duplicated relationships, and that there are no duplicated nodes other than B
.
However, as B
is both the start node and end node, S3
is still a valid path, and is, in fact, an example of a closed path.
S3
is also an example of a closed walk and a closed trail.
A path corresponds to the standard definition of (node-)isomorphic graph pattern matching.
This CIP has been submitted in the belief that CIP2017-02-06 Path Pattern Queries will be accepted and is aligned with it.
Below is outlined the syntactical changes proposed in this CIP. Note that path patterns are assumed from the CIP2017-02-06 Path Pattern Queries.
!! Clause-level argument
Pattern = PatternGroup, { PatternGroupWithModifiers }
| { PatternGroupWithModifiers }-
;
PatternGroup = PatternPart, { ',', PatternPart } ;
!! Expected to form a connected component
PatternPart = NamedPatternPart
| AnonymousPatternPart
;
NamedPatternPart = PatternVariableSpec, AnonymousPatternPart ;
AnonymousPatternPart = {
NodePattern,
{ (EdgePattern | PathPattern), NodePattern }
}-
;
PathVariableSpec = [ [ PathClassModifier ], PathClass ], [ Variable, '=' ] ;
PathClassModifier = ('OPEN' | 'CLOSED'), 'DIRECTED'
| 'DIRECTED'
;
PathClass = 'WALK'
| 'TRAIL'
| [ 'SIMPLE' ], 'PATH'
;
PatternGroupWithModifiers = PatternGroupModifiers, PatternGroup ;
PatternGroupModifiers =
PatternQuantifier [ PatternLengthRestriction] [ PatternMorphism ]
| PatternLengthRestriction [ PatternMorphism ]
| PatternMorphism
;
PatternQuantifier = 'ANY'
| 'EACH' [ 'OF' ] !! For readability
;
PatternLengthRestriction = 'SHORTEST'
| 'CHEAPEST'
| 'UNRESTRICTED'
;
PatternMorphism = 'DIFFERENT', ('NODES' | 'VERTICES')
| 'DIFFERENT', ('RELATIONSHIPS' | 'EDGES')
| 'DIFFERENT', [ VariableList ]
| 'LENIENT'
;
VariableList = '(', Variable, { ',', Variable }, ')' ;
Expression = ...
| 'DIFFERENT', VariableList
| 'DISJOINT', VariableList
;
Basic patterns in Cypher consist of a comma-separated list of pattern parts, where a pattern part is exemplified by p = (e:Employee)-[:REPORTS_TO*1..3]→(m:Manager)
.
Pattern parts may be bound to a variable, such as p
in our example, and consist of a linear chain of connected node, relationship, and path patterns.
While Cypher allows omitting path, node, and relationship variables in a pattern, this is just syntactic sugar; i.e. all parts of a pattern should be considered to be bound to a variable name from the viewpoint of pattern matching semantics (names are either provided in the query or automatically generated by a conforming implementation).
This CIP proposes to replace the notion of uniqueness scope and Cyphermorphism and all associated rules with new, configurable pattern matching semantics.
As proposed in CIP2017-02-06 Path Pattern Queries, support for binding relationship list variables in variable length patterns will be deprecated.
This CIP proposes to deprecate the existing syntax for both shortestPath
and allShortestPaths
matching of Cypher.
Each pattern consists of one or more top-level pattern parts that are given in a comma separated list.
MATCH (a)-->(b), (c)<--(d)
RETURN *
The solution (set of successful matches) of a pattern is the cross product over the solutions of all its top-level pattern parts. Thus, if we ignore uniqueness, Query 3.3.1 is semantically equivalent to Query 3.3.2.
MATCH (a)-->(b)
// sequence of matches acts like a cross product
// for each incoming row with a and b, find all matches (c)<--(d)
MATCH (c)<--(d)
RETURN *
Binding several nodes or relationships in a pattern to the same variable describes an implicit join. Thus, queries 3.3.3 and 3.3.4 are semantically equivalent.
MATCH (a)-->()<--(a)-->(b)
RETURN a
MATCH (n1)-->(n2), (n3)<--(n4), (n5)-->(b)
WHERE n1 = n4 AND n2 = n3 AND n4 = n5
RETURN n1 AS a
This CIP proposes to allow breaking of pattern parts using whitespace in order to aid visual layout of longer patterns.
MATCH
p=(a)-->(n2)-->(n3)
(n3)<--(n4)<--(n5)-->(b)
RETURN *
This CIP proposes to name the variable to which a pattern part may be assigned a pattern variable. Note that pattern variables are always bound to a connected sequence of node, relationship, and path patterns (see Syntax).
p
and q
are examples of pattern variables. A named path pattern foo
is assumed:MATCH p = (:A)-->(:B)-[:KNOWS]->(), q = (a)-/~foo/->(b)
RETURN p, q
Three classes of pattern variables are proposed:
-
WALK
: Pattern variables of this class may only be bound to pattern instances that are walks, as defined in Walk. -
TRAIL
: Pattern variables of this class may only be bound to pattern instances that are trails, as defined in Trail. -
[SIMPLE] PATH
: Pattern variables of this class may only be bound to pattern instances that are paths, as defined in Path.
p
and q
:MATCH TRAIL p = (:A)-->(:B)-[:KNOWS]->(), SIMPLE PATH q = (a)-/~foo/->(b)
RETURN p, q
The pattern variable class may be further qualified with one of the following prefixes:
-
OPEN
: This pattern variable should only be bound to walks (or trails, or paths respectively) which begin and end at distinct nodes. -
CLOSED
This pattern variable should only be bound to walks (or trails, or paths respectively) which begin and end at the same node. -
DIRECTED
This pattern variable should only be bound to walks (or trails, or simple paths respectively) in which all relationships are traversed in the same direction
Implementations are advised to signal a warning for uses of OPEN
pattern variable classes where the two endpoints of the pattern both use the same variable name.
Implementations are advised to signal a warning for uses of CLOSED
pattern variable classes where the two endpoints of the pattern both use a different variable name.
A comma separated list of named on unnamed pattern parts is called a pattern group.
This CIP proposes to add control over the semantic of matching of pattern groups by allowing to prefix pattern groups with additional pattern group modifiers.
This CIP further proposes to redefine a pattern as a non-empty sequence of pattern groups that are separated by whitespace where all but the first pattern group is prefixed by at lease one pattern group modifier.
This CIP proposes the introduction of three kinds of pattern group modifiers:
-
A pattern quantifier controls how many matches are to be returned
-
A pattern length restriction limits the length and nature of matched paths
-
A pattern morphism specifies with which morphism the pattern group is to be matched
This CIP proposes to introduce the ANY
pattern quantifier that requests that at most one match is to be returned for the whole pattern group.
This CIP proposes to introduce the EACH
pattern quantifier that requests that all matches are to be returned for the whole pattern group.
This CIP proposes to introduce the ALL
pattern quantifier that requests that all matches are to be returned for the whole pattern group.
This CIP proposes to introduce the SHORTEST
pattern length restriction to request that only shortest paths inside the pattern group are to be considered by matching.
A path is considered to be shorter than another path if it has a smaller length.
This CIP proposes to introduce the CHEAPEST
pattern length restriction to request that only shortest cheapest inside the pattern group are to be considered by matching.
A path is considered to be cheaper than another path if it has a smaller cost in terms of the cost function specified through the use of a path pattern.
This CIP proposes to introduce the UNRESTRICTED
pattern length restriction to request that all possible paths inside the pattern group are to be considered by matching.
A path is considered to be shorter than another path if it has a smaller length.
This CIP proposes to introduce the DIFFERENT NODES
pattern morphism that requests that the pattern group is to be matched using node isomorphism (i.e. all matched nodes must be different nodes).
This CIP proposes to introduce the DIFFERENT EDGES
pattern morphism that requests that the pattern group is to be matched using relationship isomorphism (i.e. all matched relationships must be different relationships).
This CIP proposes to introduce the DIFFERENT(variables)
pattern morphism that requests that the pattern group is to be matched such that all given variables are pairwise different.
This CIP proposes to introduce the LENIENT
pattern morphism that requests that the pattern group is to be matched using homomorphism (i.e. no restrictions are placed on matched entities).
If no pattern group modifiers are given, a pattern group is to be matched using EACH OF UNRESTRICTED DIFFERENT EDGES
.
If a pattern group is to be matched using either SHORTEST
or CHEAPEST
but no pattern morphism was specified, the pattern group is to be matched using LENIENT
.
If a pattern group is to be matched using either SHORTEST
or CHEAPEST
but no pattern quantifier was specified, the pattern group is to be matched using ANY
.
This CIP finally proposes a new operator DIFFERENT(var1, var2, var3, …)
that returns true
if all pairs of variables are mutually different, returns NULL
if any variable is NULL
, and returns false
otherwise.
This CIP finally proposes a new operator DISJOINT(var1, var2, var3, …)
that returns true
if all pairs of variables pass the following test:
-
If both variables are nodes, they must be different
-
If both variables are relationships, they must be different
-
If one variable is a list and the other variable is an entity, the entity must not be contained in the list
-
If both variables are lists, all pairs of elements from both lists must pass the same test
-
If one variable is
NULL
, the whole operator returnsNULL
. -
If any variable is a path, it is treated as a list of alternating nodes and relationships
The following examples demonstrates various ways in which the newly proposed constructs may be used if this CIP is adopted.
The following graph is used:
We’ll illustrate the benefits of the new homomorphic pattern matching by means of a series of queries.
Assume we wish to know which two people have grandchildren in common, as well as the names of the grandchildren. Intuitively, we can see that the only two people in the graph who have grandchildren in common are Michael Redgrave and Rachel Kempson, and that there are two grandchildren, Natasha Richardson and Jemma Redgrave. Although Roy Redgrave is a grandfather, there is no one else in the graph who has grandchildren in common with him.
Query 4.1.1 - Current semantics: single patterns
MATCH
(grandparent1:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild),
(grandparent2:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild)
WHERE grandparent1 <> grandparent2
RETURN grandparent1.name, grandparent2.name, grandchild.name
Query 4.1.1 comprises two comma-separated matches which form a single pattern, p1
.
As the query runs under the current semantics, relationship uniqueness (aka Cyphermorphism) is applied to p1
.
This means that the :HAS_CHILD
relationship given by ()-[:HAS_CHILD]→(grandchild)
is only traversed once, which results in no rows being returned.
Query 4.1.2 - Current semantics: breaking the pattern to prevent the effects of Cyphermorphism
MATCH (grandparent1:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild)
MATCH (grandparent2:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild)
WHERE grandparent1 <> grandparent2
RETURN grandparent1.name, grandparent2.name, grandchild.name
By splitting out the matches using a separate MATCH
clause, Query 4.1.2 in effect considers two patterns, p2
and p3
.
Cyphermorphism is applied to p2
and p3
separately, which resolves the limitation inherent in Query 4.1.1.
Running Query 4.1.2 returns the following results:
+------------------------------------------------------------+
| grandparent1.name | grandparent2.name | grandchild.name |
+------------------------------------------------------------+
| Michael Redgrave | Rachel Kempson | Natasha Richardson |
| Michael Redgrave | Rachel Kempson | Jemma Redgrave |
| Rachel Kempson | Michael Redgrave | Natasha Richardson |
| Rachel Kempson | Michael Redgrave | Jemma Redgrave |
+------------------------------------------------------------+
4 rows
Query 4.1.3 - New semantics: achieving homomorphism
The method to achieve homomorphism as exemplified by Query 4.1.2 is undoubtedly effective, but is potentially unintuitive and contrived. In contrast, Query 4.1.3 uses the new default semantics for simple relationship patterns, and achieves the desired result without the need to consciously manipulate the structure of the matching clause.
MATCH EACH LENIENT
(grandparent1:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild),
(grandparent2:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild)
WHERE grandparent1 <> grandparent2
RETURN grandparent1.name, grandparent2.name, grandchild.name
+------------------------------------------------------------+
| grandparent1.name | grandparent2.name | grandchild.name |
+------------------------------------------------------------+
| Michael Redgrave | Rachel Kempson | Natasha Richardson |
| Michael Redgrave | Rachel Kempson | Jemma Redgrave |
| Rachel Kempson | Michael Redgrave | Natasha Richardson |
| Rachel Kempson | Michael Redgrave | Jemma Redgrave |
+------------------------------------------------------------+
4 rows
Query 4.1.4 - New semantics: achieving Cyphermorphism
What happens in the scenarios where the current semantics — i.e. Cyphermorphism — are desirable?
All that is required is to alter the pattern morphism in the MATCH
clause from LENIENT
to DIFFERENT RELATIONSHIPS
(which is, in fact, the default).
MATCH EACH OF DIFFERENT RELATIONSHIPS
(grandparent1:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild),
(grandparent2:Person)-[:HAS_CHILD]->()-[:HAS_CHILD]->(grandchild)
WHERE grandparent1 <> grandparent2
RETURN grandparent1.name, grandparent2.name, grandchild.name
Running Query 4.1.4 will return no results.
// MATCH p=shortestPath((a)-[:X*]->()) today becomes:
MATCH SHORTEST p = (a)-[:X*]->()
RETURN *
// MATCH p=shortestPath((a)-[:X*]->()) may be expressed using path patterns, too:
MATCH p = (a)-/:X*/->()
RETURN * LIMIT 1
// MATCH p=allShortestPaths((a)-[:X*]->()) today becomes:
MATCH EACH SHORTEST p = (a)-[:X*]->()
RETURN *
// MATCH p=allShortestPaths((a)-[:X*]->()) may be expressed using path patterns, too:
MATCH p = (a)-/:X*/->()
RETURN *
The operator DISJOINT()
may now be used to precisely express Cypher’s current pattern matching semantics:
// Today (using same uniqueness scope for pat1, pat2, and pat)
MATCH pat1 = ..., pat2 = ..., pat3 = ...
RETURN *
// This CIP
MATCH pat1 = ..., pat2 = ..., pat3 = ...
WHERE DISJOINT(rels(pat1), rels(pat2), rels(pat3))
RETURN *
This proposal adds a facility to Cypher for selecting from multiple desirable pattern matching semantics.
A moderate increase in language complexity.
A change from current pattern matching semantics. However, care has been taken to reduce the amount of broken queries and to retain access to current semantics.
MATCH EACH LENIENT
allows for non-terminating queries.
[Jung10] D. Jungnickel. Graphs, Networks and Algorithms. Springer Publishing Company, 2010.
[Rose02] K. H. Rosen. Discrete Mathematics and Its Applications. McGraw-Hill, 2002.
Related to the concepts outlined in this CIP, but peripheral to the model and semantics described, several functions are envisioned as useful for working with the various types of patterns bound to pattern variables. This section presents a few such functions:
-
isOpen(p)
: returnstrue
if the source and target nodes ofp
are distinct. -
isClosed(p)
: returnstrue
if the source and target nodes ofp
are equal. -
isDirected(p)
: returnstrue
if relationships in p are traversed in the same direction. -
toTrail(p)
: returnsp
ifp
contains no duplicate relationships,null
otherwise. -
toSimplePath(p)
: returnstoTrail(p)
ifp
contains no duplicate nodes at all or if the only two duplicate nodes are the source and target nodes ofp
,null
otherwise.