Hi @christopher.rost ,
Greetings!
Sorry for the late reply. Below are the answers to your questions along with the exact query I’m using.
MATCH (root:Root {name: "RootCategory"})
MATCH path = (root)-[:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4*1..6]->(l:Tier)
WHERE ANY(tierNode IN nodes(path)
WHERE tierNode:Tier AND tierNode.name IN $keywords)
RETURN path
Path length: dynamic (1 to 6 levels)
tID: this is the identifier that contains keywords linked to it
We check if any node in the traversal path has a name present in $keywords
$keywords contains up to 100 values per tID
The query is executed for multiple tIDs (around 50), each having its own keyword list.
Result: I need the traversal path (or a formatted path string like A/B/C/D).
My main concern is performance when executing this for multiple tIDs, each with up to 100 keywords. I would appreciate suggestions on query optimization strategies, indexing recommendations, whether restructuring the query would help, and best practices when handling multiple keyword lists in Python with the Neo4j driver.
You have variable paths (which are complex) and you are calculating for each node if the name is within a list of up to 100 keywords
The query looks relatively simple, you have a few redundancies but nothing that would affect the query.
MATCH path = (root:Root {name: "RootCategory"})-[:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4*1..6]->(l:Tier)
WHERE ANY(tierNode IN nodes(path)
WHERE tierNode.name IN $keywords)
RETURN path
The most evident suggestion is to add an index on Tier.name, or perhaps add a numeric hash property and to compare numbers rather than variable length strings ...
One possible short-term solution would be to store the resolved path directly on the node to minimize repeated traversals. However, we still need to preserve the hierarchical structure within the graph.
Given this constraint, what approaches would you recommend to improve performance while maintaining the hierarchy and still supporting multi-level traversals efficiently?
Preserving the structure inside the node probably won't work (if you add nodes further down the graph, it will be stale very quickly and now you have to "erase" all upstream stored paths), just makes it more complicated.
You could cache the results elsewhere to minimise the amount of times the same query runs - same results as above, same short comings, but you won't be fragmenting the database (if those paths can get huge, those strings will be creating fragments that need to be managed).
Yes, the proper indexes are already in place, so we’re not dealing with label scans.
The issue isn’t start-node lookup, it’s repeated multi-level traversals over large subtrees. When a query starts near the top, expanding all descendants becomes expensive as depth and branching grow.
I’m also required to use variable-length traversal, since the relationships are unique and optional segments may be missing, I can’t explicitly enumerate each hop.
Storing resolved paths isn’t viable either. For example, if we have A → B → C → D and C is deleted, every stored path containing that segment would need to be updated, causing cascading writes and consistency risks.
So the core challenge is reducing repeated deep expansions without introducing stale or high-maintenance denormalized path data.
Yeah you have a challenge where you have to trust the DB ... maybe you need more memory / different configuration / more processors if you want better performance.
It may be a bit convoluted, but you can try somehing like this on latest versions (assuming name indexes on Root and Tier)
MATCH (root:Root {name: "RootCategory"})
CALL(root){
MATCH(i:Tier WHERE i.name IN $keywords)
WITH i
MATCH path1 = (root)-[:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4*1..6]->(i)
RETURN path1, i
}
RETURN path1, i
NEXT
LET l1=length(path1)
CALL(i, l1){
MATCH path2 = (i)(
()-[r:REL_T1_TO_T2|REL_t2_TO_t3|REL_t3_TO_t4]->()
){1,5}(f:Tier)
WHERE allReduce(acc=0, rel IN r | 1 + acc, acc <= 6 - l1)
RETURN path2
}
RETURN [path1, path2] as path