Improving the performance of a cypher query

I have a cypher query that runs a bit long. It takes 3 hours to go through a few thousand nodes. But my technical folks tell me I am using 1 processor and only about 3G of memory. I have 8 and 16g memory, with 12g swap.

What are the config settings to open this up a bit more. Thanks

Here's some documentation on tuning, which includes sections for memory tuning (heap and pagecache)

However 3 hours for processing a few thousand nodes isn't normal, we would expect to see at most a couple seconds.

You can run an EXPLAIN of the query to check is profile plan. You don't want to see AllNodesScans in there, and also (if at all possible) you should avoid NodeByLabelScans too. If you're looking up nodes by label + property, then you should create indexes on those for fast lookup.

Hi @bill.dickenson

Most of the reason for the slowdown is due to the lack of indexes on the properties used in the match clause. Depending on the data, I don't think memory is the problem.

  • use EXPLAIN
  • check Memory settings on neo4j.conf
  • adds some index to the property you are using in MATCH or WHERE.

This was the query

// Compare structure
MATCH (g:Compliant)
WITH collect(g.text) AS g1Names
MATCH (g:NonCompliant)
WITH apoc.coll.union(g1Names, collect(g.text)) AS uniqueNames
UNWIND uniqueNames AS dim1
UNWIND uniqueNames AS dim2
OPTIONAL MATCH p = (g1:Compliant {text: dim1})<--(g2:Compliant {text: dim2})
WITH uniqueNames, dim1, dim2, CASE WHEN p is null THEN 0 ELSE count(p) END AS edgeCount
ORDER BY dim1, dim2
WITH uniqueNames, dim1 AS g1DimNames, collect(edgeCount) AS g1Matrix
ORDER BY g1DimNames
WITH uniqueNames, g1DimNames, g1Matrix
UNWIND uniqueNames AS dim1
UNWIND uniqueNames AS dim2
OPTIONAL MATCH p = (g1:NonCompliant {text: dim1})<--(g2:NonCompliant {text: dim2})
WITH g1DimNames, g1Matrix, dim1, dim2, CASE WHEN p is NULL THEN 0 ELSE count(p) END AS edges
ORDER BY dim1, dim2
WITH g1DimNames, g1Matrix, dim1 AS g2DimNames, collect(edges) as g2Matrix
ORDER BY g1DimNames, g2DimNames
WHERE g1DimNames = g2DimNames AND g1Matrix <> g2Matrix
RETURN g1DimNames, g1Matrix, g2DimNames, g2Matrix
// Compare content
MATCH (a:Compliant)
WITH collect({text: a.text, KDM: a.KDM, node: a.node, inode: a.inode, isu: coalesce(a.isu, 0)}) AS compliantContent
MATCH (b:NonCompliant)
RETURN compliantContent, collect({text: b.text, KDM: b.KDM, node: b.node, inode: b.inode, isu: coalesce(b.isu, 0)}) AS NonCompliantContent

Max can answer specific questions on it.

The objective of the first query is to compare the strucuture of two graphs:

  • Compliant
  • NonCompliant

The solution is from stackoverflow.

The second one was done to compare content.

The problem is that there are a lot graphs to compare and to compare one, it takes too long. So questions are:

  • is there a simple way to achieve this?
  • can this query be optimized?
  • can we compare structure and content at the same time?

Regards,
Cobra

Love the graph though

Still looking for some help. Not even sure how to read this.

There is a lot that can be improved in this query.

For starters, you're lacking indexes on :Compliant(text) and :NonCompliant(text), without these the query is going to have terrible performance with that OPTIONAL MATCH patterns.

You can probably replace the top part (matching and collecting and unioning the text values) with this:

MATCH (g)
WHERE g:Compliant OR g:NonCompliant
WITH collect(DISTINCT g.text) AS uniqueNames
...

That said the approach for getting the number of edges between these nodes isn't optimal. Too many UNWINDs blowing up your cardinality. We can probably make this much easier.

If you're using Neo4j 4.1.x we can probably use UNION within a subquery so we can post-process results:

CALL {
  MATCH (g:Compliant)<--(g2:Compliant)
  WITH g, count(g2) as compliantEdges
  RETURN g.text as text, compliantEdges, 0 as nonCompliantEdges
  UNION ALL
  MATCH (g:NonCompliant)<--(g2:NonCompliant)
  WITH g, count(g2) as nonCompliantEdges
  RETURN g.text as text, 0 as compliantEdges, nonCompliantEdges
}
RETURN text, sum(compliantEdges) as compliantEdges, sum(nonCompliantEdges) as nonCompliantEdges

If you're using an earlier version, you could leverage apoc.cypher.run() from APOC Procedures instead:

CALL apoc.cypher.run("
  MATCH (g:Compliant)<--(g2:Compliant)
  WITH g, count(g2) as compliantEdges
  RETURN g.text as text, compliantEdges, 0 as nonCompliantEdges
  UNION ALL
  MATCH (g:NonCompliant)<--(g2:NonCompliant)
  WITH g, count(g2) as nonCompliantEdges
  RETURN g.text as text, 0 as compliantEdges, nonCompliantEdges", {}) YIELD value
WITH value.text as text, value.compliantEdges as compliantEdges, value.nonCompliantEdges as nonCompliantEdges
RETURN text, sum(compliantEdges) as compliantEdges, sum(nonCompliantEdges) as nonCompliantEdges

Thank you. I will point max at it for some help rewriting. Thank you again.

Created the index on Compliant/NonCompliant
Swapped out the first paragraph

I am using 4.1.x so I swapped that out. Damn this thing is fast

MATCH (g)
WHERE g:Compliant OR g:NonCompliant
WITH collect(DISTINCT g.text) AS uniqueNames
CALL {
MATCH (g:Compliant)<--(g2:Compliant)
WITH g, count(g2) as compliantEdges
RETURN g.text as text, compliantEdges, 0 as nonCompliantEdges
UNION ALL
MATCH (g:NonCompliant)<--(g2:NonCompliant)
WITH g, count(g2) as nonCompliantEdges
RETURN g.text as text, 0 as compliantEdges, nonCompliantEdges
}
RETURN text, sum(compliantEdges) as compliantEdges, sum(nonCompliantEdges) as nonCompliantEdges

Need to check results but its quick.

Update - Thank you - yes, that solves a problem we have been facing for 3 months. Not only is it blistering fast, it's the results we need for a few different problems.

I will have a follow on but now that this doesn't take hours to run, I need some time to do my own homework. Again, thank you

1 Like

Just to note, if this is your current query, then you can probably get rid of the first three line and just start from the CALL, since it doesn't look like you're using uniqueNames anywhere.

You are right. It must have been left over from an earlier try. Thank you ! We also swapped around some of the variables to use it. Its a very useful query now. Thanks again