How to use Cypher Aggregations in Neo4j Graph Data Science library | by Tomaz Bratanic | Mar, 2023
Leverage Cypher Aggregation feature to project in-memory graphs using all the flexibility and expressiveness of Cypher query language
Cypher Aggregation is a powerful feature of the Neo4j Graph Data Science library that allows users to project an in-memory graph using a flexible and expressive approach. While it was possible to use Cypher statements to project an in-memory graph for quite some time using Cypher Projection, it lacked some features, most notably the ability to project undirected relationships. Therefore, a new approach to projecting an in-memory graph in GDS was added called Cypher Aggregation. This blog post will explore the syntax and common usage of the Cypher Aggregation projection option in the Neo4j Graph Data Science Library.
Environment setup
If you want to follow along with the examples, you can open a Graph Data Science project in Neo4j Sandbox. The project has a small dataset containing information about airports, their locations, and flight routes.
We can visualize the graph schema with the following Cypher statement:
CALL db.schema.visualization()
Projecting in-memory graphs with Cypher aggregation
First, let’s quickly revisit how the Neo4j Graph Data Science library operates.
Before we can execute any graph algorithms, we first have to project an in-memory graph. The in-memory graph does not have to be an exact copy of the stored graph in the database. We have the ability to select only a subset of graph, or as you will learn later also project virtual relationships that are not stored in the database. After the in-memory graph is projected, we have can execute how many graph algorithms we want, and then either stream the results directly to the user, or write them back to the database.
Projecting an in-memory graph with Cypher Aggregation
Cypher Aggregation feature is part of the first step in Graph Data Science workflow, which is projecting an in-memory graph. It offers full flexibility of Cypher query language to select, filter, or transform a graph during projection. The syntax of the Cypher Aggregation function is the following:
gds.alpha.graph.project(
graphName: String,
sourceNode: Node or Integer,
targetNode: Node or Integer,
nodesConfig: Map,
relationshipConfig: Map,
configuration: Map
)
Only the first two parameters (graphName as sourceNode) are mandatory, however, you need to specify both the sourceNode and relationshipNode parameters to define a single relationship. We will walk through most of the options you might need to help you project graphs with Cypher Aggregation.
We will start with a simple example. Let’s say we want to project all Airport nodes and the HAS_ROUTE relationship between them.
MATCH (source:Airport)-[:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports', source, target) AS graph
RETURN graph.nodeCount AS nodeCount,
graph.relationshipCount AS relationshipCount
The Cypher statements starts with a MATCH clause that selects the relevant graph. To define a relationship with Cypher Aggregation, we input both the source and target node.
Of course, the Cypher query language offers flexibility to select any subset of the graph. So, for example, we could project only airports in the Oceania content and their flight routes.
MATCH (source:Airport)-[:HAS_ROUTE]->(target:Airport)
WHERE EXISTS {(source)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
AND EXISTS {(target)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
WITH gds.alpha.graph.project('airports-oceania', source, target) AS graph
RETURN graph.nodeCount AS nodeCount,
graph.relationshipCount AS relationshipCount
The matching Cypher statement became slightly more complicated in this example, but the Cypher Aggregation function stayed the same. The airports-oceania graph contains 272 nodes and 973 relationships. If you are experienced with Cypher, you might notice that the above Cypher statement will not capture any airports in Oceania that don’t have flight routes with other airports in Oceania.
Suppose we want to project isolated airports in the projection as well. In that case, we need to modify the Cypher matching statement slightly.
MATCH (source:Airport)
WHERE EXISTS {(source)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
OPTIONAL MATCH (source)-[:HAS_ROUTE]->(target:Airport)
WHERE EXISTS {(target)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
WITH gds.alpha.graph.project('airports-isolated', source, target) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
The relationship count remains identical, while the node count has increased to 304. Therefore, 32 airports in Oceania don’t have any flight routes to other airports in Oceania.
When dealing with multiple node and relationship types in a graph, we might want to retain information about node labels and relationship types during projection. Defining the node and relationship types during graph projection allows us to filter them at algorithm execution time.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r
UNION
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r
}
WITH gds.alpha.graph.project('airports-labels', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target)},
{relationshipType:type(r)}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
I prefer using UNION
clause when projecting multiple different graph patterns. However, what Cypher matching statement is entirely up to you. Since we are projecting two types of nodes and relationships, it is probably a good idea to retain the information about their labels and types. Therefore, we are using the sourceNodeLabels, targetNodeLabels, and relationshipType parameters. In this example, we use the existing node labels and relationship types.
However, sometimes we might want to use custom labels or relationship types during projection.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r
UNION
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r
}
WITH gds.alpha.graph.project('airports-labels-custom', source, target,
{sourceNodeLabels: CASE WHEN source.city = 'Miami'
THEN 'Miami' ELSE 'NotMiami' END,
targetNodeLabels: ['CustomLabel']},
{relationshipType: CASE WHEN type(r) = 'HAS_ROUTE'
THEN 'FLIGHT' ELSE 'NOT_FLIGHT' END}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
As you can see, we can use Cypher to dynamically define the node or relationship type or simply hardcode it. The custom node label or relationship type can also be calculated in the Cypher matching statement if it is more complicated.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r,
CASE WHEN source.city = target.city
THEN 'INTRACITY' ELSE 'INTERCITY' END as rel_type
UNION
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r, type(r) as rel_type
}
WITH gds.alpha.graph.project('airports-labels-precalculated', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target)},
{relationshipType: rel_type}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
Sometimes, we also want to project node or relationship properties.
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports-properties', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
sourceNodeProperties: {runways: source.runways},
targetNodeProperties: {runways: target.runways}},
{relationshipType: type(r), properties: {distance: r.distance}}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
The node or relationship properties are defined as a map object (dictionary or JSON object for Python or JS developers), where the key represents the projected property, and the value represents the projected value. This syntax allows us to project properties that are calculated during projection.
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports-properties-custom', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
sourceNodeProperties: {runways10: source.runways * 10},
targetNodeProperties: {runways10: target.runways * 10}},
{relationshipType: type(r),
properties: {inverseDistance: 1 / r.distance}}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
Again, we can use all the flexibility of Cypher to calculate any node or relationship properties. Similary as with node labels, we can also calculate the custom properties in the MATCH
clause.
An important thing to note is that the current projection behavior is that the engine stores the node properties when it first encounters a node. However, on subsequent encounters of the same node, it ignores the node properties completely. Therefore, you have to be careful to calculate identical node properties for both source and target nodes. Otherwise, there may be discrepancies between what is projected and what you expect.
Some graph algorithms in the Neo4j Graph Data Science library expect undirected relationships. A relationship cannot be stored as undirected in the database and must be explicitly defined during graph projection.
Suppose you want to treat all projected relationships as undirected.
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports-undirected', source, target,
{}, // nodeConfiguration
{}, // relationshipConfiguration
{undirectedRelationshipTypes: ['*']}
) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
We can use the undirectedRelationshipType to specify which relationships should be projected as undirected. In practice, you can observe that the relationship count doubled when we projected an undirected graph.
Sometimes you might want to project a single relationship type as undirected while treating the other as directed.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r
UNION ALL
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r
}
WITH gds.alpha.graph.project('airports-undirected-specific', source, target,
{},
{relationshipType:type(r)},
{undirectedRelationshipTypes: ['IN_CITY']}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
In this example, the HAS_ROUTE relationship is treated as directed, while the IN_CITY relationship is treated as undirected. When we want to specify specific relationship types to be treated as undirected, we must include the relationshipType parameter in the relationship configuration.
Lastly, we can also project virtual relationships. A virtual relationship is a relationship that is not stored in the database.
Suppose you want to examine the cities based on their flight connections. The database doesn’t have flight relationships between cities. Instead of creating the relationships in the database, you can calculate them during graph projection.
MATCH (sourceCity)<-[:IN_CITY]-(:Airport)-[:HAS_ROUTE]->(:Airport)-[:IN_CITY]->(targetCity)
WITH sourceCity, targetCity, count(*) AS countOfRoutes
WITH gds.alpha.graph.project('airports-virtual', sourceCity, targetCity,
{},
{relationshipType:'VIRTUAL_ROUTE'},
{}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
As you can observe, projecting virtual relationships is very easy with Cypher Aggregation projection. We have calculated the count of routes between various cities and added it as a relationship property in the projected graph.
Let’s calculate the most important cities based on the PageRank algorithm to finish off this blog post.
CALL gds.pageRank.stream('airports-virtual')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS city, score
ORDER BY score DESC
LIMIT 5
Results
Summary
Cypher Aggregation is the newer option to project in-memory graphs in the Neo4j Graph Data Science library using Cypher statements. Specifically, it can be used to project undirected relationships, which is impossible with the older Cypher Projection. However, with the added flexibility of selecting and transforming graphs during projection comes a performance cost. Therefore, if you can, you should use Native Projection when possible for performance reasons. On the other hand, when you have specific use cases to project a particular subset of a graph, calculate custom properties or project virtual relationships, Cypher Aggregation is your friend.
Leverage Cypher Aggregation feature to project in-memory graphs using all the flexibility and expressiveness of Cypher query language
Cypher Aggregation is a powerful feature of the Neo4j Graph Data Science library that allows users to project an in-memory graph using a flexible and expressive approach. While it was possible to use Cypher statements to project an in-memory graph for quite some time using Cypher Projection, it lacked some features, most notably the ability to project undirected relationships. Therefore, a new approach to projecting an in-memory graph in GDS was added called Cypher Aggregation. This blog post will explore the syntax and common usage of the Cypher Aggregation projection option in the Neo4j Graph Data Science Library.
Environment setup
If you want to follow along with the examples, you can open a Graph Data Science project in Neo4j Sandbox. The project has a small dataset containing information about airports, their locations, and flight routes.
We can visualize the graph schema with the following Cypher statement:
CALL db.schema.visualization()
Projecting in-memory graphs with Cypher aggregation
First, let’s quickly revisit how the Neo4j Graph Data Science library operates.
Before we can execute any graph algorithms, we first have to project an in-memory graph. The in-memory graph does not have to be an exact copy of the stored graph in the database. We have the ability to select only a subset of graph, or as you will learn later also project virtual relationships that are not stored in the database. After the in-memory graph is projected, we have can execute how many graph algorithms we want, and then either stream the results directly to the user, or write them back to the database.
Projecting an in-memory graph with Cypher Aggregation
Cypher Aggregation feature is part of the first step in Graph Data Science workflow, which is projecting an in-memory graph. It offers full flexibility of Cypher query language to select, filter, or transform a graph during projection. The syntax of the Cypher Aggregation function is the following:
gds.alpha.graph.project(
graphName: String,
sourceNode: Node or Integer,
targetNode: Node or Integer,
nodesConfig: Map,
relationshipConfig: Map,
configuration: Map
)
Only the first two parameters (graphName as sourceNode) are mandatory, however, you need to specify both the sourceNode and relationshipNode parameters to define a single relationship. We will walk through most of the options you might need to help you project graphs with Cypher Aggregation.
We will start with a simple example. Let’s say we want to project all Airport nodes and the HAS_ROUTE relationship between them.
MATCH (source:Airport)-[:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports', source, target) AS graph
RETURN graph.nodeCount AS nodeCount,
graph.relationshipCount AS relationshipCount
The Cypher statements starts with a MATCH clause that selects the relevant graph. To define a relationship with Cypher Aggregation, we input both the source and target node.
Of course, the Cypher query language offers flexibility to select any subset of the graph. So, for example, we could project only airports in the Oceania content and their flight routes.
MATCH (source:Airport)-[:HAS_ROUTE]->(target:Airport)
WHERE EXISTS {(source)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
AND EXISTS {(target)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
WITH gds.alpha.graph.project('airports-oceania', source, target) AS graph
RETURN graph.nodeCount AS nodeCount,
graph.relationshipCount AS relationshipCount
The matching Cypher statement became slightly more complicated in this example, but the Cypher Aggregation function stayed the same. The airports-oceania graph contains 272 nodes and 973 relationships. If you are experienced with Cypher, you might notice that the above Cypher statement will not capture any airports in Oceania that don’t have flight routes with other airports in Oceania.
Suppose we want to project isolated airports in the projection as well. In that case, we need to modify the Cypher matching statement slightly.
MATCH (source:Airport)
WHERE EXISTS {(source)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
OPTIONAL MATCH (source)-[:HAS_ROUTE]->(target:Airport)
WHERE EXISTS {(target)-[:ON_CONTINENT]->(:Continent {name:"OC"})}
WITH gds.alpha.graph.project('airports-isolated', source, target) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
The relationship count remains identical, while the node count has increased to 304. Therefore, 32 airports in Oceania don’t have any flight routes to other airports in Oceania.
When dealing with multiple node and relationship types in a graph, we might want to retain information about node labels and relationship types during projection. Defining the node and relationship types during graph projection allows us to filter them at algorithm execution time.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r
UNION
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r
}
WITH gds.alpha.graph.project('airports-labels', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target)},
{relationshipType:type(r)}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
I prefer using UNION
clause when projecting multiple different graph patterns. However, what Cypher matching statement is entirely up to you. Since we are projecting two types of nodes and relationships, it is probably a good idea to retain the information about their labels and types. Therefore, we are using the sourceNodeLabels, targetNodeLabels, and relationshipType parameters. In this example, we use the existing node labels and relationship types.
However, sometimes we might want to use custom labels or relationship types during projection.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r
UNION
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r
}
WITH gds.alpha.graph.project('airports-labels-custom', source, target,
{sourceNodeLabels: CASE WHEN source.city = 'Miami'
THEN 'Miami' ELSE 'NotMiami' END,
targetNodeLabels: ['CustomLabel']},
{relationshipType: CASE WHEN type(r) = 'HAS_ROUTE'
THEN 'FLIGHT' ELSE 'NOT_FLIGHT' END}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
As you can see, we can use Cypher to dynamically define the node or relationship type or simply hardcode it. The custom node label or relationship type can also be calculated in the Cypher matching statement if it is more complicated.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r,
CASE WHEN source.city = target.city
THEN 'INTRACITY' ELSE 'INTERCITY' END as rel_type
UNION
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r, type(r) as rel_type
}
WITH gds.alpha.graph.project('airports-labels-precalculated', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target)},
{relationshipType: rel_type}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
Sometimes, we also want to project node or relationship properties.
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports-properties', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
sourceNodeProperties: {runways: source.runways},
targetNodeProperties: {runways: target.runways}},
{relationshipType: type(r), properties: {distance: r.distance}}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
The node or relationship properties are defined as a map object (dictionary or JSON object for Python or JS developers), where the key represents the projected property, and the value represents the projected value. This syntax allows us to project properties that are calculated during projection.
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports-properties-custom', source, target,
{sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
sourceNodeProperties: {runways10: source.runways * 10},
targetNodeProperties: {runways10: target.runways * 10}},
{relationshipType: type(r),
properties: {inverseDistance: 1 / r.distance}}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
Again, we can use all the flexibility of Cypher to calculate any node or relationship properties. Similary as with node labels, we can also calculate the custom properties in the MATCH
clause.
An important thing to note is that the current projection behavior is that the engine stores the node properties when it first encounters a node. However, on subsequent encounters of the same node, it ignores the node properties completely. Therefore, you have to be careful to calculate identical node properties for both source and target nodes. Otherwise, there may be discrepancies between what is projected and what you expect.
Some graph algorithms in the Neo4j Graph Data Science library expect undirected relationships. A relationship cannot be stored as undirected in the database and must be explicitly defined during graph projection.
Suppose you want to treat all projected relationships as undirected.
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
WITH gds.alpha.graph.project('airports-undirected', source, target,
{}, // nodeConfiguration
{}, // relationshipConfiguration
{undirectedRelationshipTypes: ['*']}
) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
We can use the undirectedRelationshipType to specify which relationships should be projected as undirected. In practice, you can observe that the relationship count doubled when we projected an undirected graph.
Sometimes you might want to project a single relationship type as undirected while treating the other as directed.
CALL {
MATCH (source:Airport)-[r:HAS_ROUTE]->(target:Airport)
RETURN source, target, r
UNION ALL
MATCH (source:Airport)-[r:IN_CITY]->(target:City)
RETURN source, target, r
}
WITH gds.alpha.graph.project('airports-undirected-specific', source, target,
{},
{relationshipType:type(r)},
{undirectedRelationshipTypes: ['IN_CITY']}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
In this example, the HAS_ROUTE relationship is treated as directed, while the IN_CITY relationship is treated as undirected. When we want to specify specific relationship types to be treated as undirected, we must include the relationshipType parameter in the relationship configuration.
Lastly, we can also project virtual relationships. A virtual relationship is a relationship that is not stored in the database.
Suppose you want to examine the cities based on their flight connections. The database doesn’t have flight relationships between cities. Instead of creating the relationships in the database, you can calculate them during graph projection.
MATCH (sourceCity)<-[:IN_CITY]-(:Airport)-[:HAS_ROUTE]->(:Airport)-[:IN_CITY]->(targetCity)
WITH sourceCity, targetCity, count(*) AS countOfRoutes
WITH gds.alpha.graph.project('airports-virtual', sourceCity, targetCity,
{},
{relationshipType:'VIRTUAL_ROUTE'},
{}) AS graph
RETURN graph.nodeCount AS nodeCount, graph.relationshipCount AS relationshipCount
As you can observe, projecting virtual relationships is very easy with Cypher Aggregation projection. We have calculated the count of routes between various cities and added it as a relationship property in the projected graph.
Let’s calculate the most important cities based on the PageRank algorithm to finish off this blog post.
CALL gds.pageRank.stream('airports-virtual')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS city, score
ORDER BY score DESC
LIMIT 5
Results
Summary
Cypher Aggregation is the newer option to project in-memory graphs in the Neo4j Graph Data Science library using Cypher statements. Specifically, it can be used to project undirected relationships, which is impossible with the older Cypher Projection. However, with the added flexibility of selecting and transforming graphs during projection comes a performance cost. Therefore, if you can, you should use Native Projection when possible for performance reasons. On the other hand, when you have specific use cases to project a particular subset of a graph, calculate custom properties or project virtual relationships, Cypher Aggregation is your friend.