Multiple match takes too long to execute due to node with a lot of connections

Hello,
We are trying (@julie.cavarroc) to optimize a cypher query. Here is the data model:


D is a central node. Every node D has nodes E, A and V.
Property P from node D is integer,
Property C from node A is integer
For one particular node D called D1 (node A called A1, node E called E1), I would like to retrieve all the other nodes D sharing the same node V and satisfying:

  • D1.P-50<D.P<D1.P+50,
  • A1.C = A.C,
  • D1.Date>D.Date>D1.Date-duration({hours:1})
    To be noted, the node V can have more than 400000 other nodes D
    Here is an example of a graph:

The next steps would be to calculate the JaroWinkler distance between E1 and all the nodes E linked to nodes D satisfying the above conditions.

Here is the code that I tried:

MATCH (e1:E)<-[:DE]-(d1:D)-[:DA]->(a1:A),(a2:A)<-[:DA]-(d2:D)-[:DE]->(e2:E)
WHERE  EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2)) 
	   AND  (a1.C=a2.C )
	   AND (d1.P-50<=d2.P<=d1.P+50)
       AND NOT EXISTS ((d1)-->(d2)) 
       AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1}) 
       AND id(d1) <> id(d2) 
RETURN DISTINCT d1, count(d2)

What I saw from the EXPLAIN is that the two queries in the MATCH are too consuming. I would like to select only nodes D satisfying the conditions, not all the nodes D.

Also, I have the version neo4j 4 (and not v5).

Thank you in advance for your help.

1 Like

What is wrong with:

MATCH (e1:E)<-[:DE]-(d1:D)-[:DA]->(a1:A)<-[:DA]-(d2:D)-[:DE]->(e2:E)
WHERE  EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2)) 
	   AND (d1.P-50<=d2.P<=d1.P+50)
       AND NOT EXISTS ((d1)-->(d2)) 
       AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1}) 
       AND id(d1) <> id(d2) 
RETURN DISTINCT d1, count(d2)

The nodes D do not share necesseraly nodes A. Only the property A.C should be equal.

So your (D) can be connected to multiple (A), and you only want the connections where A.C is the same?

D is connected to only one A
Yes, I want conections where A.C is the same

If I get your query

Reordering the conditions:

MATCH (e1:E)<-[:DE]-(d1:D)-[:DA]->(a1:A),(a2:A)<-[:DA]-(d2:D)-[:DE]->(e2:E)
WHERE NOT EXISTS ((d1)-->(d2)) 
      AND id(d1) <> id(d2) 
      AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1}) 
      AND (a1.C = a2.C) 
      AND EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2)) 
      AND (d1.P-50<=d2.P<=d1.P+50)
RETURN DISTINCT d1, count(d2)

Maybe other ways of organising them would make it even simpler?

Thank you for your help. I tried by reordering as mentionned but it does not work. Is there a way to filter only on nodes D satisfying the conditions ?
Here is my EXPLAIN anonymized

What do you mean by "doesn't work"?

Your query's WHERE is just a sequence of AND, which should be commutative.

For example, these should yield the same result:

WHERE id(d1) <> id(d2) AND (a1.C = a2.C) 
WHERE (a1.C = a2.C) AND id(d1) <> id(d2) 

Neo4 does not like SQL like queries with sets

Can you try with this ?

MATCH (d1)-[:DV]->(:V)<-[:DV]-(d2)
WHERE id(d1) <> id(d2)
AND NOT EXISTS ((d1)-->(d2))
AND (d1.P-50<=d2.P<=d1.P+50)
AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1})
AND EXISTS (()<-[:DE]-(d1:D)-[:DA]->())
AND EXISTS (()<-[:DE]-(d2:D)-[:DA]->())
WITH d1, d2, head([(d1:D)-[:DA]->(a1) | a1]) as a1, head([(d2:D)-[:DA]->(a2) | a2]) as a2
WITH d1, d2, a1, a2
WHERE a1.C=a2.C
RETURN DISTINCT d1, count(d2)

Also Neo4j does not like when nodes contains thousands of relationships. It is definitely better for hops and relationships.

The request works in itself. However, it takes too much time to execute

Hello Farouk,

Thank you for your help. I tried, however, I have the same problem, it takes too much time to execute.
We found a way to bypass with the following query:

MATCH (e1:E1)<-[:DE]-(d1:D)-[:DA]->(a1:A)
	CALL{
		WITH  d1, a1
	    MATCH (a2:A)<-[:A]-(d2:D)-[:E]->(e2:E)
			WHERE  EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2)) 
		    AND  (a1.C= a2.C )
			AND (d1.P-50<=d2.P<=d1.P+50)
			AND NOT EXISTS ((d1)-[]->(d2)) 
	        AND d1.DATE>=d2.DATE>=d1.DATE- duration({hours:24}) AND id(d1) <> id(d2) 
	        RETURN d2
	        }
RETURN DISTINCT d1, collect(d2) 

What do you think ?

How many nodes and relationships do you have?

I would add a distinct on d1, a1

MATCH (e1:E1)<-[:DE]-(d1:D)-[:DA]->(a1:A)
WITH distinct d1, a1

Because you can have many paths and the query will execute as many time as you have e1, d1, a1 combinations.

My approach would be to filter as much as possible using labels, id and relationships and at the end filter on properties unless you have indexes.

Usually the idea is to give an entry point in the graph.

Check also the number of relationships per node or label.

Can you give us a profile or an explain plan ? Also if you want to test quickly add a limit. Its going to speed up your first test. When a query is too slow, it is painful to watch the spinner for ages

1 Like