How to optimize Query that runs out of memory

I’ve got a query that’s slow and runs out of memory, probably because there are way too many paths. A simplified version of the query looks like this:

MATCH p=(a:A)-->(:B)<-[:Flow*0..5]-(:B)-->(c:C) RETURN p, a, c

The number of :Cs is fairly limited. Maybe a few dozen or so. But there are hundreds of thousands of :Bs that eat up all my memory. I’m not interested in the Bs, I’m only interested in the As and Cs, but I need the Bs because they connect them.

Actually, this query isn’t quite right, because I’m also interested in the Cs that connect to the intermediate Bs and I want to have relationships between those intermediate Cs and the final C. So maybe the query should be more like:

MATCH p=((a:A)-->(:B)-->(c1:C)), MATCH q=((c1)<--(:B)<-[:FLOW]-(:B)-->c2:C)
...etc

But that’s a mess, and I can’t get the variable depth that way. Although maybe that would get rid of the excessive Bs? I’m not sure.

Maybe I should add additional relationships between the As and Cs and the Cs and Cs if there’s a :FLOW connection between their Bs, but I’d rather not add additional relationships if they’re not in the original data. Although the more I think about it, the more I think that might be the only way to get the result I want. I’m stil feeling my way around this problem; my Cypher are a bit rusty.

So I guess my questions are:

  • Is there a way to optimize that first query so it won’t follow every possible redundant path through all the Bs? When A connects to 20 Bs which all connect to the same C, I don’t need all 20 paths, but just one.
  • Is there a way to return a virtual relationship that isn’t actually in the data? Like a relationship between C and A, or C and C, that doesn’t exist in the database?

The more I think about it, the more I think I’ll end up going with some preprocessing in neo4j like:

MATCH (a:A)-->(:B)-->(c:C) MERGE (c)-[:VIRTUAL_FLOW]->(a)
MATCH (a:A)-->(:B)<-[:FLOW]-(:B)-->(c:C) MERGE (c)-[:VIRTUAL_FLOW]->(a)
MATCH (c1:C)-->(:B)<-[:FLOW]-(:B)-->(c2:C) MERGE (c2)-[:VIRTUAL_FLOW]->(c1)

But is there a way to avoid adding those extra relationships without running out of memory?

One option, when the total number of (:C)s is smaller than the number of (:B)s connected to the (:A), might be this:
MATCH (a:A), (c:C) WHERE EXISTS { (a)-[:REL_TO_B]->(:B)-[:REL_TO_C]->(c) } RETURN a, c

Whether that’s actually the case, is something I’ll have to figure out.

I’ll also have a look at shortestPath(). I’m not very familiar with it yet, but it sounds like it should return a single path.

shortestPath() is to tell you what’s the shortest path between 2 nodes (“in a forest”).

Why are you returning the paths if you say you only need a and c ? maybe that’s why you are running out of memory?

It’s not quite true that I only need a and c. I also need the relationships between a and c, and those relationships are indirect through b. But I don’t actually want to show b in the frontend, and instead pretend there’s a direct relationship between a and c. And between the various c’s.

And I don’t want just those relationships, but also other data attached to various nodes attached to a and c. But maybe I should make that separate queries.

What I’m currently doing is to create those virtual relationships between a and c in neo4j, and that works very well. Except that it adds data that doesn’t really exist and hides data that does. But I don’t think this is ever going to be a single-stage process anyway.

The only downside is that with so many B’s, it’s going to MERGE every virtual relationship many times over, but that is surprisingly not that slow, and it only needs to be done once.

So I guess that’s my solution: create more relationships that serve specific frontend purposes.

If you only one path and not concerned with the specific path, then I would use shortest path function as @joshcornejo suggested.

You can the use APOC virtual relationships to create a virtual relationship between you end nodes and return it in your results.

Responding to this again, because the solution I’d found turns out not to be good enough. The virtual relationships I’d added to the database can accidentally connect and therefore show more data than is actually connected. I’ll expand the slightly obfuscated data structure a bit.

I’ve got two types of node I want to show: the :Origin that was initially selected, and the :Connection that I want to show only when there’s a connection. That connection runs through :Sets that consist of :Elements which in turn connect to :OtherElements that are part of :OtherSets. And that other set connects to the :Connection. (Yes, Set/Element are slightly different from OtherSet/OtherElement; I don’t think it’s relevant, but I’m including it just in case. There are a few hundred thousand OtherElements; far more than Elements.)

It’s possible that not all elements connect to other elements of the same set. Consider this case:

(o:Origin)-->(a:Connection)-->(b:Connection)

They’re only connected through these underlying sets and elements, and it’s the elements from a that connect to b don’t actually connect to o, even if there are other elements from a that do connect to o and not to b. In that case, I don’t want to see b. This is why my solution with the [:VIRTUAL_FLOW] doesn’t work. I’ll have to follow the actual elements.

One thing I’ve done is take this one step at a time. If (b:Connection) connects to (c:Connection), for example, I don’t fetch c until the user decides to expand b. That makes the query a lot simpler, but still very expensive. Imagine if we expand these Connections a few more hops, I still want to know if the next hop still connects to Origin through all these layers of OtherElements. Which means I need to use a [:FLOW*0..] relation, and that apparently explodes and eats up all the memory.

So I’ve been trying a couple of different approaches:

MATCH (o:Origin {id: $origin})→(:Set)<-[:PART_OF_SET]-(:Element)<--(:OtherElement)<-[:FLOW*0..]-(connectingElement:OtherElement)-[:PART_OF_SET]->(:OtherSet)-->(expandingConnection:Connection {id: $connection})
MATCH (connectingElement)<-[:FLOW]-(:OtherElement)-[:PART_OF_SET]->(:OtherSet)-->(newConnections:Connection)
RETURN newConnections

This is super slow and eats up all memory. I can limit the depth to [Flow*0..3] and that will make it much faster, but limit how far I can go, and I don’t want that.

So now I’m trying to do this a step at a time, dropping all elements I don’t need:

MATCH (expandingConnection:OAR {id: $connection})<--(:OtherSet)<-[:PART_OF_SET]-(connectingElement:OtherElement)
WITH expandingConnection, connectingElement
MATCH ANY (:Origin { id: $origin})--(:Set)<-[:PART_OF_SET]-(:Element)<--(:OtherElement)<-[:FLOW*0..5]-(connectingElement)
WITH expandingConnection, connectingElement
MATCH (connectingElement)<-[:FLOW]-(:OtherElement)-[:PART_OF_SET]->(:OtherSet)
-->(newConnections:Connection)
RETURN newConnections

But this is still surprisingly slow. I’m using the specific connectingElements here. I can’t imagine their number explodes beyond the number there actually are. So why is the variable length relationship still so slow? The system knows the begin and end of this variable length relationship. I’m using ANY because I only want to verify that the connection exist (I don’t even need the shortest connection, just any). And still this is slow.

I’ve ran this through the profiler of course, and the ANY shortest path gives 1262 rows (what does that mean here?) and after that 2044 rows to the OtherSet the connectingElement belongs to. Why that many? It shouldn’t even be looking at the OtherSet at this point, because I only needed that to find the connectingElements, which I got before I started on the path.

After that, the profiler goes through a lot of steps I don’t understand, each with 2440 rows, when all we should be doing at this point is just make that one final hop. I clearly don’t understand the profiler output.

So what are ways to improve this? And what does the profiler output really mean?

The whole query took over a minute.

I’ll include the plan output:

Cypher 5

Planner COST

Runtime SLOTTED

Runtime version 2025.07

+-------------------------------------+----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator                            | Id | Details                                                                                             | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------------------+----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults                     |  0 | origin, newConnection                                                                               |              0 | 2440 |   41480 |              0 |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                             |  1 | newConnection:Connection                                                                            |              0 | 2440 |    2440 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Expand(All)                        |  2 | (anon_13)-->(newConnection)                                                                         |              0 | 2440 |   10642 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                             |  3 | anon_13:OtherSet                                                                                    |              0 | 2440 |    2440 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Expand(All)                        |  4 | (anon_11)-[:PART_OF_SET]->(anon_13)                                                                 |              0 | 2440 |   32738 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                             |  5 | anon_11:OtherElement                                                                                |              0 | 2440 |    2440 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Expand(All)                        |  6 | (be)<-[:FLOW]-(anon_11)                                                                             |              0 | 2440 |    3578 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeHashJoin                       |  7 | anon_1                                                                                              |              0 |  159 |       0 |           4728 |                    0/0 |
| |\                                  +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +Expand(All)                      |  8 | (be)-[:PART_OF_SET]->(anon_1)                                                                       |             11 | 2044 |   20465 |                |                    0/0 |
| | |                                 +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +StatefulShortestPath(All, Trail) |  9 | ANY 1 (u)-[`anon_3`]-(`anon_4`)<-[`anon_5`]-(`anon_6`)<-[`anon_7`]-(`anon_8`)<-[`anon_9`*0..5]-(connectingElement)| 4| 1262 |   53542 |        4207071 |                    0/0 |
| | |                                 |    |         expanding from: u                                                                           |                |      |         |                |                        |
| | |                                 |    |     inlined predicates: anon_37:UC_LINKED_TO_IDS                                                    |                |      |         |                |                        |
| | |                                 |    |                         anon_38:IDS                                                                 |                |      |         |                |                        |
| | |                                 |    |                         anon_39:PART_OF_SET                                                         |                |      |         |                |                        |
| | |                                 |    |                         anon_40:IDE                                                                 |                |      |         |                |                        |
| | |                                 |    |                         anon_41:BDL_GDL_MAPPING                                                     |                |      |         |                |                        |
| | |                                 |    |                         anon_42:BdlElement                                                          |                |      |         |                |                        |
| | |                                 |    |                         anon_44:BDL_FLOW                                                            |                |      |         |                |                        |
| | |                                 |    |                         be:BdlElement                                                               |                |      |         |                |                        |
| | |                                 +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeek                    | 10 | RANGE INDEX o:Origin(id) WHERE id = $autostring_1                                                   |              1 |    1 |       2 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                             | 11 | anon_1:OtherSet                                                                                     |              0 |   18 |      18 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Expand(All)                        | 12 | (a1)<--(anon_1)                                                                                     |              0 |   18 |      51 |                |                    0/0 |
| |                                   +----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeek                      | 13 | RANGE INDEX expand:Connection(id) WHERE id = $autostring_0, cache[a1.id]                            |              1 |    1 |       2 |                |                    0/0 |
+-------------------------------------+----+-----------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+

Total database accesses: 169838, total allocated memory: 4211823

(I hope I anonymised it all correctly. In the original query, I do specify all my relationship types; I just didn’t want to invent properly anonymised ones for this.)

Note that the total number of newConnections found here was 6. That is not the issue. But it’s ballooning in the middle.

Is there maybe some index I could set that helps with these variable-length paths?

I honestly don’t understand why it takes so long. Note that I’m going against the :FLOW here. Flow can go from an element to any number of other elements, but it’s unlikely (but not impossible) for it to receive flow from more than one element. So I’m going from the wide end of the almost-tree to the narrow end. That shouldn’t be that hard.

A complication is that loops may exist. I don’t know why it would happen, but it’s not impossible that two elements send flow to each other, creating an infinite loop. I can see how this could be a problem for depth-first searching without checking for loops, but I’d really hope neo4j does check for loops in these variable path lengths (especially since I’m asking for the shortest), but also I’d expect it to use breadth-first, in which case there shouldn’t be an issue at all.

Or should there? What if a connectingElement has no connection back to origin? Will the algorithm get stuck endlessly looping around one of cycles? I would hope it detects that. And even with a length higher than 5 (or 7?), this runs out of memory.

I think I figured this out. Put DISTINCT after every WITH. Suddenly it’s blazingly fast. Somehow I ended up with thousands of duplicates, apparently.

I’m still interested in other optimisation suggestions, but DISTINCT is the real life saver here.

My final query:

MATCH (expandingConnection:OAR {id: $connection})<--(:OtherSet)<-[:PART_OF_SET]-(connectingElement:OtherElement)
WITH DISTINCT expandingConnection, connectingElement
MATCH ANY (:Origin { id: $origin})--(:Set)<-[:PART_OF_SET]-(:Element)<--(:OtherElement)<-[:FLOW*0..5]-(connectingElement)
WITH DISTINCT expandingConnection, connectingElement
MATCH (connectingElement)<-[:FLOW]-(:OtherElement)-[:PART_OF_SET]->(:OtherSet)
-->(newConnections:Connection)
RETURN DISTINCT expandingConnection, newConnections