Improving the performance of a cypher query

bill_dickenson · October 20, 2020, 5:16pm

I have a cypher query that runs a bit long. It takes 3 hours to go through a few thousand nodes. But my technical folks tell me I am using 1 processor and only about 3G of memory. I have 8 and 16g memory, with 12g swap.

What are the config settings to open this up a bit more. Thanks

andrew_bowman · October 20, 2020, 7:56pm

Here's some documentation on tuning, which includes sections for memory tuning (heap and pagecache)

However 3 hours for processing a few thousand nodes isn't normal, we would expect to see at most a couple seconds.

You can run an EXPLAIN of the query to check is profile plan. You don't want to see AllNodesScans in there, and also (if at all possible) you should avoid NodeByLabelScans too. If you're looking up nodes by label + property, then you should create indexes on those for fast lookup.

koji · October 20, 2020, 10:45pm

Hi @bill_dickenson

Most of the reason for the slowdown is due to the lack of indexes on the properties used in the match clause. Depending on the data, I don't think memory is the problem.

use EXPLAIN
check Memory settings on neo4j.conf
adds some index to the property you are using in MATCH or WHERE.

bill_dickenson · October 21, 2020, 2:28pm

This was the query

// Compare structure
MATCH (g:Compliant)
WITH collect(g.text) AS g1Names
MATCH (g:NonCompliant)
WITH apoc.coll.union(g1Names, collect(g.text)) AS uniqueNames
UNWIND uniqueNames AS dim1
UNWIND uniqueNames AS dim2
OPTIONAL MATCH p = (g1:Compliant {text: dim1})<--(g2:Compliant {text: dim2})
WITH uniqueNames, dim1, dim2, CASE WHEN p is null THEN 0 ELSE count(p) END AS edgeCount
ORDER BY dim1, dim2
WITH uniqueNames, dim1 AS g1DimNames, collect(edgeCount) AS g1Matrix
ORDER BY g1DimNames
WITH uniqueNames, g1DimNames, g1Matrix
UNWIND uniqueNames AS dim1
UNWIND uniqueNames AS dim2
OPTIONAL MATCH p = (g1:NonCompliant {text: dim1})<--(g2:NonCompliant {text: dim2})
WITH g1DimNames, g1Matrix, dim1, dim2, CASE WHEN p is NULL THEN 0 ELSE count(p) END AS edges
ORDER BY dim1, dim2
WITH g1DimNames, g1Matrix, dim1 AS g2DimNames, collect(edges) as g2Matrix
ORDER BY g1DimNames, g2DimNames
WHERE g1DimNames = g2DimNames AND g1Matrix <> g2Matrix
RETURN g1DimNames, g1Matrix, g2DimNames, g2Matrix
// Compare content
MATCH (a:Compliant)
WITH collect({text: a.text, KDM: a.KDM, node: a.node, inode: a.inode, isu: coalesce(a.isu, 0)}) AS compliantContent
MATCH (b:NonCompliant)
RETURN compliantContent, collect({text: b.text, KDM: b.KDM, node: b.node, inode: b.inode, isu: coalesce(b.isu, 0)}) AS NonCompliantContent

bill_dickenson · October 21, 2020, 2:29pm

Max can answer specific questions on it.

Cobra · October 21, 2020, 3:05pm

The objective of the first query is to compare the strucuture of two graphs:

Compliant
NonCompliant

The solution is from stackoverflow.

The second one was done to compare content.

The problem is that there are a lot graphs to compare and to compare one, it takes too long. So questions are:

is there a simple way to achieve this?
can this query be optimized?
can we compare structure and content at the same time?

Regards,
Cobra

bill_dickenson · October 21, 2020, 3:14pm

bill_dickenson · October 21, 2020, 9:31pm

Love the graph though

bill_dickenson · October 22, 2020, 2:17pm

Still looking for some help. Not even sure how to read this.

andrew_bowman · October 22, 2020, 9:50pm

There is a lot that can be improved in this query.

For starters, you're lacking indexes on :Compliant(text) and :NonCompliant(text), without these the query is going to have terrible performance with that OPTIONAL MATCH patterns.

You can probably replace the top part (matching and collecting and unioning the text values) with this:

MATCH (g)
WHERE g:Compliant OR g:NonCompliant
WITH collect(DISTINCT g.text) AS uniqueNames
...

That said the approach for getting the number of edges between these nodes isn't optimal. Too many UNWINDs blowing up your cardinality. We can probably make this much easier.

If you're using Neo4j 4.1.x we can probably use UNION within a subquery so we can post-process results:

CALL {
  MATCH (g:Compliant)<--(g2:Compliant)
  WITH g, count(g2) as compliantEdges
  RETURN g.text as text, compliantEdges, 0 as nonCompliantEdges
  UNION ALL
  MATCH (g:NonCompliant)<--(g2:NonCompliant)
  WITH g, count(g2) as nonCompliantEdges
  RETURN g.text as text, 0 as compliantEdges, nonCompliantEdges
}
RETURN text, sum(compliantEdges) as compliantEdges, sum(nonCompliantEdges) as nonCompliantEdges

If you're using an earlier version, you could leverage apoc.cypher.run() from APOC Procedures instead:

CALL apoc.cypher.run("
  MATCH (g:Compliant)<--(g2:Compliant)
  WITH g, count(g2) as compliantEdges
  RETURN g.text as text, compliantEdges, 0 as nonCompliantEdges
  UNION ALL
  MATCH (g:NonCompliant)<--(g2:NonCompliant)
  WITH g, count(g2) as nonCompliantEdges
  RETURN g.text as text, 0 as compliantEdges, nonCompliantEdges", {}) YIELD value
WITH value.text as text, value.compliantEdges as compliantEdges, value.nonCompliantEdges as nonCompliantEdges
RETURN text, sum(compliantEdges) as compliantEdges, sum(nonCompliantEdges) as nonCompliantEdges

bill_dickenson · October 22, 2020, 11:05pm

Thank you. I will point max at it for some help rewriting. Thank you again.

bill_dickenson · October 23, 2020, 11:12pm

Created the index on Compliant/NonCompliant
Swapped out the first paragraph

I am using 4.1.x so I swapped that out. Damn this thing is fast

MATCH (g)
WHERE g:Compliant OR g:NonCompliant
WITH collect(DISTINCT g.text) AS uniqueNames
CALL {
MATCH (g:Compliant)<--(g2:Compliant)
WITH g, count(g2) as compliantEdges
RETURN g.text as text, compliantEdges, 0 as nonCompliantEdges
UNION ALL
MATCH (g:NonCompliant)<--(g2:NonCompliant)
WITH g, count(g2) as nonCompliantEdges
RETURN g.text as text, 0 as compliantEdges, nonCompliantEdges
}
RETURN text, sum(compliantEdges) as compliantEdges, sum(nonCompliantEdges) as nonCompliantEdges

bill_dickenson · October 23, 2020, 11:13pm

Need to check results but its quick.

bill_dickenson · October 24, 2020, 2:15pm

Update - Thank you - yes, that solves a problem we have been facing for 3 months. Not only is it blistering fast, it's the results we need for a few different problems.

I will have a follow on but now that this doesn't take hours to run, I need some time to do my own homework. Again, thank you

andrew_bowman · October 26, 2020, 10:09pm

Just to note, if this is your current query, then you can probably get rid of the first three line and just start from the CALL, since it doesn't look like you're using uniqueNames anywhere.

bill_dickenson · October 26, 2020, 11:09pm

You are right. It must have been left over from an earlier try. Thank you ! We also swapped around some of the variables to use it. Its a very useful query now. Thanks again

Topic		Replies	Views
How to fetch millions of data faster? Cypher	30	8507	October 18, 2019
Memory settings for big databases Server performance	40	3235	August 1, 2019
Dreadfully Long Query Times Cypher performance , cypher	5	427	October 2, 2020
Extremely slow query when profile looks very good? Cypher	11	4090	October 3, 2019
A slow running cypher query Cypher performance	8	5204	February 10, 2020

Improving the performance of a cypher query

Related topics