Reducing a multigraph to a single graph - relationship query with "count" too slow/crashes

catherinepatino · February 17, 2021, 4:32pm

I am trying to reduce a multigraph to a single graph using the following query for the cypher projection:

CALL gds.graph.create.cypher(
    'myGraph',
    'MATCH (n:WebContent {componentId: 0} ) RETURN id(n) AS id, labels(n) as labels,',
    'MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
     WHERE id(n1)<id(n2)
	 RETURN id(n1) AS source, id(n2) AS target, count(m) AS weight'
)

This follows the example provided in the cypher projection docs shown below:

CALL gds.graph.create.cypher(
    'coauthorship-graph',
    'MATCH (n:Author) RETURN id(n) AS id, labels(n) as labels,',
    'MATCH (p1:Author)-[:WROTE]->(a:Article)<-[:WROTE]-(p2:Author)
     RETURN id(p1) AS source, id(p2) AS target, count(a) AS weight'
)

Some notes about my data for context: There are ~38,000 WebContent nodes in this particular component (i.e. componentId 0) and ~77,000 "m" nodes (with various labels) that link these WebContent nodes together in some way. The "m" node with the most WebContent nodes linked to it has about 1,000 WebContent nodes linked to it.

The problem is that this query never ends up completing and eventually crashes the db server. I tried giving a shared label to all the "m" nodes and using (m:SharedLabel) in the query above to see if that would help, but it still didn't work. Also, I want to note that componentId is an index for WebContent nodes so this also isn't contributing to the issue.

Here's the strange thing that I'm not yet understanding. I decided to examine things outside of the cypher projection and instead just directly query the database using the same query used in the relationship projection above. This is the first query I tried:

MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
RETURN count(duration.inDays(n1.latestDate, n2.earliestDate).days)

This query did run to completion, taking a little less than 3 minutes to complete, for about 85 million pairwise calculations. I wanted to try this one to see if the pairwise duration calculations between the date properties of n1 and n2 are feasible at this scale. I realized that this query actually calculates the duration for all n1->m<-n2 instances, not just for unique n1-n2 pairs (i.e. if a certain n1-n2 pair shares 5 nodes in between them, the time difference between their date properties will be calculated 5 times). So this gave me hope that this still finished within a reasonable time.

The second query I tried was the following:

MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
With n1, n2, count(m) as related_node_count
RETURN count(duration.inDays(n1.latestDate, n2.earliestDate).days)

By adding the line of With n1, n2, count(m) as related_node_count, this prevents calculations on duplicate n1-n2 pairs and thus only does the calculation once on each unique n1-n2 pair no matter how many "m" nodes they have in common (this is more analogous to the cypher projection query I'm trying to do above, which returns count(m) as the relationship weight between each unique n1-n2 node pair). This query, however, never completed and ended up crashing the server, similar to what happened with the cypher projection I tried.

This seems counter-intuitive to me - shouldn't the second query here in theory be faster than the one above, since it's not performing duplicate pairwise calculations? Or is there something innately slow about counting or collecting the "m" nodes first? I feel like I'm missing something, so any insight into what is going on here would be greatly appreciated! I feel like the amount of nodes I'm running this on is relatively small in the scheme of things, so I'm not sure why this isn't working. (side note: I tried all of these queries on smaller components and they work just fine - this particular component is the largest one I have).

clem · February 17, 2021, 5:10pm

Do you have the logs? That will help diagnose the crash.

catherinepatino · February 17, 2021, 5:20pm

Here is what it's showing in the logs - it's due to an OutOfMemoryError:

Feb 17 01:24:38 ip-10-0-0-96 neo4j[179982]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Feb 17 01:24:38 ip-10-0-0-96 systemd[1]: neo4j.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 17 01:24:38 ip-10-0-0-96 systemd[1]: neo4j.service: Failed with result 'exit-code'.
Feb 17 01:24:38 ip-10-0-0-96 systemd[1]: neo4j.service: Scheduled restart job, restart counter is at 2.

clem · February 17, 2021, 5:27pm

Try increasing the memory (obviously).

Trying adding PROFILE or EXPLAIN to the query to see what's going on.

Maybe you can split the query into multiple parts and then do a UNION on the results.

catherinepatino · February 17, 2021, 9:38pm

Thanks for the reply; yeah, after looking at the results of PROFILE on a smaller component, I definitely think it's a cardinality issue and is running unnecessary duplicate operations. I'm trying to figure out how this query can be written differently to avoid this (Also, I'm not sure why the example provided in the docs - i.e. the 'coauthorship-graph' - is written in this way if it results in high cardinality).

catherinepatino · February 18, 2021, 4:25pm

Actually I think it ultimately still comes down to a memory issue, caused by the count(m) operation between n1-n2 pairs.. The execution plan when I run this on a smaller component (of about 1500 WebContent nodes) shows that it takes 22MB of memory just to perform this count operation. See the screenshot below.
eager_aggregation

So obviously on my larger component of ~38,000 WebContent nodes, this memory requirement is much higher and thus causes the crashing. This is concerning because we will eventually be needing to work with even larger components. I'm currently looking into a batching option described here.

Topic		Replies	Views
Optimizing a query issue Graph Algorithms/Graph Data Science	15	104	May 3, 2025
How to scale down Neo4j graph Newbie Questions	3	324	December 26, 2021
Fail to project graph with cypher Neo4j Graph Platform migrated	2	370	July 28, 2022
Project a graph with multiple Labels and Relationships based on Temporal value Graph Algorithms/Graph Data Science cypher , graph , multi-graph	5	742	December 18, 2023
Algorithm On Subgraph Graph Algorithms/Graph Data Science cypher , graph , data-science	1	527	January 18, 2022

Get Certified in June!

Reducing a multigraph to a single graph - relationship query with "count" too slow/crashes

Related topics