Best practices for queries that can take hours to complete

What is the right way to execute and get results from a query that takes hours to process? Or is just Neo4j not built to do such things? Or is there some hidden timeout limit that silently kills such queries? Or do I need the enterprise edition?

Obviously running the query in browser and waiting for the result is out of the question.

I had some success with wrapping the query into apoc.export.cypher.query() and executing it locally through cypher-shell running in background using '&' on Linux as suggested here Cypher-Shell - how to run testquery.cypher in background. Works for queries taking e.g. 20 minutes but more demanding queries seems to just die silently without leaving any trace in the logs.

Hi @marek.kasle.na

There are many things that may cause a query to take a long time to run. Usually, some query optimisation and/or model refactoring can make huge impacts on this.

Are you able to share your queries/talk about what optimisations you have done to them?

Also, how much memory have you configured for the queries to run?

Cheers,

Lju

see the very good video:


which also points to:

Not knowing specifics of your problem, here are some random ideas (stabs in the dark and I don't mean to insult your intelligence if you already know about these...):

  • Make sure you have indexes created for things that you are filtering on. Also potentially useful is composite indexes: https://maxdemarzi.com/2020/02/19/composite-indexes-in-neo4j-4-0/#more-5696
  • Use integers or floats where possible (e.g. toInteger() and toFloat() and store them in the DB before doing the query)
  • Avoid cartesian products: Instead of MATCH(a:A),(b:B) WHERE ... use MATCH(a:A) MATCH(b:B) WHERE...
  • Use Labels in your queries. MATCH(a)-[r]-> ... is more expensive than MATCH(a:A)-[r:R]->...
  • Try parallelism: https://neo4j.com/labs/apoc/4.2/cypher-execution/parallel/ (I haven't tried it myself, so I'm not sure how it works exactly.)
  • figure out how to winnow down the possible matches as much as possible and as soon as possible in the query.
  • I suspect this is true: take advantage of NULL values instead of setting things to empty strings or 0. I believe that property looks will go faster if Neo4J sees that there is no property, instead of looking up the property and then getting its value to do the comparison. E.g. length(a.property)=0 is more expensive than a.property IS NULL but I don't know how much more expensive it is.

I hope this helps. If you could show us your query (along with some statistics: call apoc.meta.stats and :schema) perhaps we could better be able to help you.

Thank you both for valuable tips. Now lets say, hypothetically, that the query is as optimized as it can be and I am perfectly ok with it running for 2 hours before I get the result because it is some complex computation or extraction of a large subgraph that simply takes a lot of time.
Now is there some obstacle, like some default dbms.transaction.timeout that will stop me from getting the result from Neo4j?

Another link on query optimizations:

I wasn't clearly your question on time out...

But if you set the timeout to 0, it won't timeout:

1 Like

I found another interesting video. What is very interesting is at minute 15:30.

If your query has TWO things that you are filtering and that has an index, then it's better to force Neo4J to use both indexes.

I'm surprised by this!. (I tried this with version 4.2)

Their example using the Movie DB. Instead of this:

PROFILE MATCH p = (p1:Person)-[:ACTED_IN*6]-(p5:Person)
WHERE p1.name='Tom Cruise' and p5.name='Kevin Bacon' 
RETURN [n in nodes(p) | coalesce(n.title, n.name)]

do this:

PROFILE MATCH p = (p1:Person)-[:ACTED_IN*6]-(p5:Person) 
USING INDEX p1:Person(name)
USING INDEX p5:Person(name)
WHERE p1.name='Tom Cruise' and p5.name='Kevin Bacon' 
RETURN [n in nodes(p) | coalesce(n.title, n.name)]

In this hypothetical situation, provided that you've managed your query in such a way that the memory load is controlled, yes, you could set configuration in such a way to ensure the query runs for as long as it needs to