D is a central node. Every node D has nodes E, A and V.
Property P from node D is integer,
Property C from node A is integer
For one particular node D called D1 (node A called A1, node E called E1), I would like to retrieve all the other nodes D sharing the same node V and satisfying:
D1.P-50<D.P<D1.P+50,
A1.C = A.C,
D1.Date>D.Date>D1.Date-duration({hours:1})
To be noted, the node V can have more than 400000 other nodes D
Here is an example of a graph:
The next steps would be to calculate the JaroWinkler distance between E1 and all the nodes E linked to nodes D satisfying the above conditions.
Here is the code that I tried:
MATCH (e1:E)<-[:DE]-(d1:D)-[:DA]->(a1:A),(a2:A)<-[:DA]-(d2:D)-[:DE]->(e2:E)
WHERE EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2))
AND (a1.C=a2.C )
AND (d1.P-50<=d2.P<=d1.P+50)
AND NOT EXISTS ((d1)-->(d2))
AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1})
AND id(d1) <> id(d2)
RETURN DISTINCT d1, count(d2)
What I saw from the EXPLAIN is that the two queries in the MATCH are too consuming. I would like to select only nodes D satisfying the conditions, not all the nodes D.
MATCH (e1:E)<-[:DE]-(d1:D)-[:DA]->(a1:A)<-[:DA]-(d2:D)-[:DE]->(e2:E)
WHERE EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2))
AND (d1.P-50<=d2.P<=d1.P+50)
AND NOT EXISTS ((d1)-->(d2))
AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1})
AND id(d1) <> id(d2)
RETURN DISTINCT d1, count(d2)
MATCH (e1:E)<-[:DE]-(d1:D)-[:DA]->(a1:A),(a2:A)<-[:DA]-(d2:D)-[:DE]->(e2:E)
WHERE NOT EXISTS ((d1)-->(d2))
AND id(d1) <> id(d2)
AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1})
AND (a1.C = a2.C)
AND EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2))
AND (d1.P-50<=d2.P<=d1.P+50)
RETURN DISTINCT d1, count(d2)
Thank you for your help. I tried by reordering as mentionned but it does not work. Is there a way to filter only on nodes D satisfying the conditions ?
Here is my EXPLAIN anonymized
MATCH (d1)-[:DV]->(:V)<-[:DV]-(d2)
WHERE id(d1) <> id(d2)
AND NOT EXISTS ((d1)-->(d2))
AND (d1.P-50<=d2.P<=d1.P+50)
AND d1.DATE>=d2.DATE>=d1.DATE - duration({hours:1})
AND EXISTS (()<-[:DE]-(d1:D)-[:DA]->())
AND EXISTS (()<-[:DE]-(d2:D)-[:DA]->())
WITH d1, d2, head([(d1:D)-[:DA]->(a1) | a1]) as a1, head([(d2:D)-[:DA]->(a2) | a2]) as a2
WITH d1, d2, a1, a2
WHERE a1.C=a2.C
RETURN DISTINCT d1, count(d2)
Also Neo4j does not like when nodes contains thousands of relationships. It is definitely better for hops and relationships.
Thank you for your help. I tried, however, I have the same problem, it takes too much time to execute.
We found a way to bypass with the following query:
MATCH (e1:E1)<-[:DE]-(d1:D)-[:DA]->(a1:A)
CALL{
WITH d1, a1
MATCH (a2:A)<-[:A]-(d2:D)-[:E]->(e2:E)
WHERE EXISTS((d1)-[:DV]->(:V)<-[:DV]-(d2))
AND (a1.C= a2.C )
AND (d1.P-50<=d2.P<=d1.P+50)
AND NOT EXISTS ((d1)-[]->(d2))
AND d1.DATE>=d2.DATE>=d1.DATE- duration({hours:24}) AND id(d1) <> id(d2)
RETURN d2
}
RETURN DISTINCT d1, collect(d2)
MATCH (e1:E1)<-[:DE]-(d1:D)-[:DA]->(a1:A)
WITH distinct d1, a1
Because you can have many paths and the query will execute as many time as you have e1, d1, a1 combinations.
My approach would be to filter as much as possible using labels, id and relationships and at the end filter on properties unless you have indexes.
Usually the idea is to give an entry point in the graph.
Check also the number of relationships per node or label.
Can you give us a profile or an explain plan ? Also if you want to test quickly add a limit. Its going to speed up your first test. When a query is too slow, it is painful to watch the spinner for ages