I am trying to reduce a multigraph to a single graph using the following query for the cypher projection:
CALL gds.graph.create.cypher(
'myGraph',
'MATCH (n:WebContent {componentId: 0} ) RETURN id(n) AS id, labels(n) as labels,',
'MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
RETURN id(n1) AS source, id(n2) AS target, count(m) AS weight'
)
This follows the example provided in the cypher projection docs shown below:
CALL gds.graph.create.cypher(
'coauthorship-graph',
'MATCH (n:Author) RETURN id(n) AS id, labels(n) as labels,',
'MATCH (p1:Author)-[:WROTE]->(a:Article)<-[:WROTE]-(p2:Author)
RETURN id(p1) AS source, id(p2) AS target, count(a) AS weight'
)
Some notes about my data for context: There are ~38,000 WebContent nodes in this particular component (i.e. componentId 0) and ~77,000 "m" nodes (with various labels) that link these WebContent nodes together in some way. The "m" node with the most WebContent nodes linked to it has about 1,000 WebContent nodes linked to it.
The problem is that this query never ends up completing and eventually crashes the db server. I tried giving a shared label to all the "m" nodes and using (m:SharedLabel)
in the query above to see if that would help, but it still didn't work. Also, I want to note that componentId
is an index for WebContent nodes so this also isn't contributing to the issue.
Here's the strange thing that I'm not yet understanding. I decided to examine things outside of the cypher projection and instead just directly query the database using the same query used in the relationship projection above. This is the first query I tried:
MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
RETURN count(duration.inDays(n1.latestDate, n2.earliestDate).days)
This query did run to completion, taking a little less than 3 minutes to complete, for about 85 million pairwise calculations. I wanted to try this one to see if the pairwise duration calculations between the date properties of n1 and n2 are feasible at this scale. I realized that this query actually calculates the duration for all n1->m<-n2
instances, not just for unique n1-n2 pairs (i.e. if a certain n1-n2 pair shares 5 nodes in between them, the time difference between their date properties will be calculated 5 times). So this gave me hope that this still finished within a reasonable time.
The second query I tried was the following:
MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
With n1, n2, count(m) as related_node_count
RETURN count(duration.inDays(n1.latestDate, n2.earliestDate).days)
By adding the line of With n1, n2, count(m) as related_node_count
, this prevents calculations on duplicate n1-n2 pairs and thus only does the calculation once on each unique n1-n2 pair no matter how many "m" nodes they have in common (this is more analogous to the cypher projection query I'm trying to do above, which returns count(m)
as the relationship weight between each unique n1-n2 node pair). This query, however, never completed and ended up crashing the server, similar to what happened with the cypher projection I tried.
This seems counter-intuitive to me - shouldn't the second query here in theory be faster than the one above, since it's not performing duplicate pairwise calculations? Or is there something innately slow about counting or collecting the "m" nodes first? I feel like I'm missing something, so any insight into what is going on here would be greatly appreciated! I feel like the amount of nodes I'm running this on is relatively small in the scheme of things, so I'm not sure why this isn't working. (side note: I tried all of these queries on smaller components and they work just fine - this particular component is the largest one I have).