Optimising query performance with a relatively simple match

Dan_Keeley · July 2, 2020, 1:01pm

Hi,

We have a simple setup where there are two types of nodes , a base node and another node which is an actor that can link the base nodes together.

All i want to do, is that from a given base node, to know how many other nodes i can get to that have a particular property set.

MATCH(n:Base{baseNumber:"123456"})-[*..2]-(n2:Base) WHERE n2.property > 2 RETURN COUNT(n2)

The query works. but it's not fast. In some cases 4-5s to run, because it's processing millions of nodes. The profile shows no differences whether i add an index on "property" or not.

So can this be tuned further, or must I change the model somehow? I've checked the database size etc and it all ought to be fitting in ram. When i batch run these queries the system ends up sitting at 100% cpu - it gets through them, without too much pain, but they're just taking too long.

To clarify, I can query the data for 2000 nodes (individual queries), in between 7 and 13 minutes.

dave.pirie · July 2, 2020, 3:00pm

Indexes are only used to find the starting nodes in a match, like the (n) in your query.

The problem is likely in your undirected, variable-length relationship. Are your relationships really directionless? Without a label on the relationship, it is considering all relationships from that node.

It is also helpful to profile the query by adding "PROFILE" before the query. It will show exactly how the query is processed and where cardinality has exploded and slowed your query down.

Dan_Keeley · July 3, 2020, 6:45am

Sadly in this case yes - we're just looking for any connection. hence i do want any relationship from that node.

yes, the PROFILE did show the explosion.

So; I suspect this is wrong, but i could resolve this by creating the relationships based on the node attribute i'm filtering on right? then i wouldn't have to get to the node before "filtering" it. It would lead to some weird relationship names, and feels wrong, but at least you wouldnt have to go to all the nodes at the final level. you'd still need to go to all the nodes at the first level.

dave.pirie · July 3, 2020, 3:43pm

Reducing cardinality in the query is a delicate balancing act, but a critical exercise in making queries like yours return in a reasonable time. There are many presentations on Cypher Query Optimization.

Since you are really only interested in the presence of one or two relationships between the n and n2, match the two endpoints and filter on the existence of the relationships.

PROFILE MATCH (n:Base {baseNumber:"123456"})
MATCH (n2:Base) 
WHERE n.property > 2 
     AND ( (n)--(n2) OR (n)--(:Base)--(n2) )
RETURN COUNT(n2);

In the Profile, you'll still see a huge expansion for the second predicate. Anything longer than two relationships, should likely use apoc.path.expand(), or one of the other apoc.path functions, depending on your data model. It may still be a worthwhile exercise to try apoc.path.expand() on your data set.

Topic		Replies	Views
Tuning Cypher queries by understanding cardinality Cypher performance , cypher , knowledge-base	0	1002	August 23, 2018
Querying relationships slow performance Cypher performance , cypher , relationship	4	2099	October 15, 2020
Writing out performant queries for n-relationship queries on a single node where n > 2 Cypher	20	3548	October 14, 2019
Issues with cardinality Neo4j Graph Platform cypher	5	350	June 2, 2021
Optimizing a Query Cypher	1	306	October 30, 2020

Optimising query performance with a relatively simple match

Related topics