Optimising the cypher query

I have a query that fetches that the information from a specific traversal path.

It builds a path in a specific format by fetching tag from each node in this manner A/b/c/d

Okay now the query is working well on neo4j browser. Like 500 ms after giving like 20 keywords in list and forming a path.

When I'm integrating it with python code, there are 50 tID that have 100 keywords each as a list.

These 100 keywords can be there or it can't be there in the DB.

So for 50 tIDs having 100 keywords(list) how much time it should take to form a path and then return the results by traversal.

What are the approaches that I can use to optimise it

Hi @anshulchaintha7 , thanks for your post. To help you out, it would be helpful if you provide some more information.

  1. Can you share your query ?
  2. Is your path length static or dynamic ?
  3. Are you comapring the tag of each node against the list with an IN , or is the ordering of tags related to the traversal depth ?
  4. What is the tID? The id to start the traversal?
  5. How should your result look like?

It’s hard to answer a question of how much time it will take, since this depends on a lot of things (DBMS setup, indexes, etc.)

Looking forward to help after you shared some more insights. Thanks!

Hi @christopher.rost ,
Greetings!
Sorry for the late reply. Below are the answers to your questions along with the exact query I’m using.

MATCH (root:Root {name: "RootCategory"})
MATCH path = (root)-[:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4*1..6]->(l:Tier)
WHERE ANY(tierNode IN nodes(path)
WHERE tierNode:Tier AND tierNode.name IN $keywords)
RETURN path

Path length: dynamic (1 to 6 levels)

tID: this is the identifier that contains keywords linked to it

We check if any node in the traversal path has a name present in $keywords

$keywords contains up to 100 values per tID

The query is executed for multiple tIDs (around 50), each having its own keyword list.

Result: I need the traversal path (or a formatted path string like A/B/C/D).

My main concern is performance when executing this for multiple tIDs, each with up to 100 keywords. I would appreciate suggestions on query optimization strategies, indexing recommendations, whether restructuring the query would help, and best practices when handling multiple keyword lists in Python with the Neo4j driver.

You have variable paths (which are complex) and you are calculating for each node if the name is within a list of up to 100 keywords

The query looks relatively simple, you have a few redundancies but nothing that would affect the query.

MATCH path = (root:Root {name: "RootCategory"})-[:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4*1..6]->(l:Tier)
  WHERE ANY(tierNode IN nodes(path)
    WHERE tierNode.name IN $keywords)
RETURN path

The most evident suggestion is to add an index on Tier.name, or perhaps add a numeric hash property and to compare numbers rather than variable length strings ...

@joshcornejo, Thank you for the suggestion.

One possible short-term solution would be to store the resolved path directly on the node to minimize repeated traversals. However, we still need to preserve the hierarchical structure within the graph.

Given this constraint, what approaches would you recommend to improve performance while maintaining the hierarchy and still supporting multi-level traversals efficiently?

Preserving the structure inside the node probably won't work (if you add nodes further down the graph, it will be stale very quickly and now you have to "erase" all upstream stored paths), just makes it more complicated.

You could cache the results elsewhere to minimise the amount of times the same query runs - same results as above, same short comings, but you won't be fragmenting the database (if those paths can get huge, those strings will be creating fragments that need to be managed).

Did you try adding indices?

Yes, the proper indexes are already in place, so we’re not dealing with label scans.

The issue isn’t start-node lookup, it’s repeated multi-level traversals over large subtrees. When a query starts near the top, expanding all descendants becomes expensive as depth and branching grow.

I’m also required to use variable-length traversal, since the relationships are unique and optional segments may be missing, I can’t explicitly enumerate each hop.

Storing resolved paths isn’t viable either. For example, if we have A → B → C → D and C is deleted, every stored path containing that segment would need to be updated, causing cascading writes and consistency risks.

So the core challenge is reducing repeated deep expansions without introducing stale or high-maintenance denormalized path data.

Yeah you have a challenge where you have to trust the DB ... maybe you need more memory / different configuration / more processors if you want better performance.

Hello hello!

It may be a bit convoluted, but you can try somehing like this on latest versions (assuming name indexes on Root and Tier)


MATCH (root:Root {name: "RootCategory"})
CALL(root){
  MATCH(i:Tier WHERE i.name IN $keywords)
  WITH i
  MATCH path1 = (root)-[:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4*1..6]->(i)
  RETURN path1, i
}
RETURN path1, i
NEXT
LET l1=length(path1)
CALL(i, l1){
  MATCH path2 = (i)(
      ()-[r:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4]->()
  ){1,5}(f:Tier)
  WHERE allReduce(acc=0, rel IN r | 1 + acc, acc <= 6 - l1)
  RETURN path2
}
RETURN [path1, path2] as path