Best practices for queries that can take hours to complete

marek_kasle_na · November 26, 2020, 3:21pm

What is the right way to execute and get results from a query that takes hours to process? Or is just Neo4j not built to do such things? Or is there some hidden timeout limit that silently kills such queries? Or do I need the enterprise edition?

Obviously running the query in browser and waiting for the result is out of the question.

I had some success with wrapping the query into apoc.export.cypher.query() and executing it locally through cypher-shell running in background using '&' on Linux as suggested here Cypher-Shell - how to run testquery.cypher in background. Works for queries taking e.g. 20 minutes but more demanding queries seems to just die silently without leaving any trace in the logs.

lju · November 27, 2020, 12:24pm

Hi @marek_kasle_na

There are many things that may cause a query to take a long time to run. Usually, some query optimisation and/or model refactoring can make huge impacts on this.

Are you able to share your queries/talk about what optimisations you have done to them?

Also, how much memory have you configured for the queries to run?

Cheers,

Lju

clem · November 27, 2020, 5:22pm

see the very good video:

which also points to:

Execution plans - Cypher Manual
Tag performance - Knowledge Base
Query tuning - Cypher Manual (especially part 2.)

Not knowing specifics of your problem, here are some random ideas (stabs in the dark and I don't mean to insult your intelligence if you already know about these...):

Make sure you have indexes created for things that you are filtering on. Also potentially useful is composite indexes: Composite Indexes in Neo4j 4.0 | Max De Marzi
Use integers or floats where possible (e.g. toInteger() and toFloat() and store them in the DB before doing the query)
Avoid cartesian products: Instead of MATCH(a:A),(b:B) WHERE ... use MATCH(a:A) MATCH(b:B) WHERE...
Use Labels in your queries. MATCH(a)-[r]-> ... is more expensive than MATCH(a:A)-[r:R]->...
Try parallelism: Parallel Cypher Execution - APOC Extended Documentation (I haven't tried it myself, so I'm not sure how it works exactly.)
figure out how to winnow down the possible matches as much as possible and as soon as possible in the query.
I suspect this is true: take advantage of NULL values instead of setting things to empty strings or 0. I believe that property looks will go faster if Neo4J sees that there is no property, instead of looking up the property and then getting its value to do the comparison. E.g. length(a.property)=0 is more expensive than a.property IS NULL but I don't know how much more expensive it is.

I hope this helps. If you could show us your query (along with some statistics: call apoc.meta.stats and :schema) perhaps we could better be able to help you.

marek_kasle_na · November 27, 2020, 6:42pm

Thank you both for valuable tips. Now lets say, hypothetically, that the query is as optimized as it can be and I am perfectly ok with it running for 2 hours before I get the result because it is some complex computation or extraction of a large subgraph that simply takes a lot of time.
Now is there some obstacle, like some default dbms.transaction.timeout that will stop me from getting the result from Neo4j?

clem · November 27, 2020, 11:27pm

Another link on query optimizations:

I wasn't clearly your question on time out...

But if you set the timeout to 0, it won't timeout:

clem · November 29, 2020, 2:55am

I found another interesting video. What is very interesting is at minute 15:30.

If your query has TWO things that you are filtering and that has an index, then it's better to force Neo4J to use both indexes.

I'm surprised by this!. (I tried this with version 4.2)

Their example using the Movie DB. Instead of this:

PROFILE MATCH p = (p1:Person)-[:ACTED_IN*6]-(p5:Person)
WHERE p1.name='Tom Cruise' and p5.name='Kevin Bacon' 
RETURN [n in nodes(p) | coalesce(n.title, n.name)]

do this:

PROFILE MATCH p = (p1:Person)-[:ACTED_IN*6]-(p5:Person) 
USING INDEX p1:Person(name)
USING INDEX p5:Person(name)
WHERE p1.name='Tom Cruise' and p5.name='Kevin Bacon' 
RETURN [n in nodes(p) | coalesce(n.title, n.name)]

lju · December 4, 2020, 12:23pm

In this hypothetical situation, provided that you've managed your query in such a way that the memory load is controlled, yes, you could set configuration in such a way to ensure the query runs for as long as it needs to

Topic		Replies	Views
Best practices for queries that can take hours to complete Neo4j Graph Platform migrated , cypher-tagged	2	151	December 7, 2022
Took 11 hours to finish running query. Need Help Query Tuning an APOC Function to Update Graph Import / Export	7	942	October 13, 2021
Help with optimizing query Cypher performance , browser , cypher , operations	3	432	September 21, 2021
Query taking unusually long to complete Neo4j Graph Platform performance , cypher	16	858	July 14, 2021
Execution time of cypher query Cypher cypher	5	5728	December 26, 2021

July Summer Fun!

Best practices for queries that can take hours to complete

Related topics