Delete in transactions running out of memory

Hi,

I have a lab test DB with approximately 600m nodes and 2b relationships. I have a clean up task that runs to delete dense nodes. These nodes are flagged by one process, and then cleaned up by another.

The following cypher is run by clean up process:

MATCH (n:Identity {identifierValue:'identifier1'})
MATCH (n)-[:ON_20250628]-(i:IntermediaryDay)
WITH DISTINCT(i)
OPTIONAL MATCH (i)-[:RELATIONSHIP_A]-(p)
CALL (p) { 
	DETACH DELETE p
} IN TRANSACTIONS OF 1000 ROWS
OPTIONAL MATCH (i)-[c:RELATIONSHIP_B]-()
CALL (c) { 
	DELETE c
} IN TRANSACTIONS OF 1000 ROWS
SET i.deleted=true

In this particular case there is only 1 intermediary day, but it is connected to 40m nodes.
The query errors out with the following exception :

Neo4jError: Neo.TransientError.General.MemoryPoolOutOfMemoryError (The allocation of an extra 128.0 KiB would use more than the limit 10.0 GiB. Currently using 10.0 GiB. dbms.memory.transaction.total.max threshold reached) 

I thought that using IN TRANSACTIONS would trickle the result of the match with 40m nodes, but this does not seem to be the case.
What am I missing?

Thanks

Hi @wormlb
If there are lots of DISTINCT (i) records, try this code. Start with EXPLAIN to verify.

EXPLAIN
CALL () {
  MATCH (n:Identity {identifierValue: 'identifier1'})-[:ON_20250628]-(i:IntermediaryDay)
  WITH DISTINCT (i)
  CALL (i) {
    OPTIONAL MATCH (i)-[:RELATIONSHIP_A]-(p)
    DETACH DELETE p
  }
  CALL (i) {
    OPTIONAL MATCH (i)-[c:RELATIONSHIP_B]-()
    DELETE c
  }
  SET i.deleted = true
} IN TRANSACTIONS OF 1000 ROWS;

or

EXPLAIN
CALL () {
  MATCH (n:Identity {identifierValue: 'identifier1'})-[:ON_20250628]-(i:IntermediaryDay)
  WITH DISTINCT (i)
  OPTIONAL MATCH (i)-[:RELATIONSHIP_A]-(p)
  OPTIONAL MATCH (i)-[c:RELATIONSHIP_B]-()
  DETACH DELETE p
  DELETE c
  SET i.deleted = true
} IN TRANSACTIONS OF 1000 ROWS;

have you tried doing it in batches (maybe as low as 10 nodes if you have such highly connected ones)?

CALL apoc.periodic.iterate(
  "MATCH (n:Identity {identifierValue:'identifier1'}) 
   MATCH (n)-[:ON_20250628]-(i:IntermediaryDay) 
   RETURN DISTINCT i",
  "OPTIONAL MATCH (i)-[:RELATIONSHIP_A]-(p) 
   CALL (p) { 
      DETACH DELETE p 
   } IN TRANSACTIONS OF 1000 ROWS 
   OPTIONAL MATCH (i)-[c:RELATIONSHIP_B]-() 
   CALL (c) { 
      DELETE c 
   } IN TRANSACTIONS OF 1000 ROWS 
   SET i.deleted=true",
  {batchSize: 10000, iterateList: true}
)

Hi guys, thanks for the replies.

@koji as stated there is only 1 DISTINCT (i) record in this case. It is the OPTIONAL MATCH which hits a lot of nodes. Wouldn't putting the MATCH inside CALL try to load all 40m and then DETACH DELETE them for each i? I have got myself twisted up with this logic before, but that is my current understanding.

@joshcornejo yes, I have tried batches of 1 and get the same result. Running an explain claims that there are only 32 bytes loaded, a pointer I guess. But I'm suspecting that somewhere deep down in neo4j it actually has to load everything that comes back from the MATCH, before it starts batching it. Otherwise there would be no reason for SKIP/LIMIT to exist? I don't know the inner workings at this level :disappointed_face:
As for APOC, it was recommended to me by Neo4j Professional Services to avoid this, so I actually don't even have it installed.

I managed to get this to work, at least on the data set I had, by splitting into 2 queries, effectively this :

MATCH (n:Identity {identifierValue:'identifier1'})
MATCH (n)-[:ON_20250628]-(i:IntermediaryDay)
WITH DISTINCT(i)
OPTIONAL MATCH (i)-[:RELATIONSHIP_A]-(p)
CALL (p) { 
	DETACH DELETE p
} IN TRANSACTIONS OF 1000 ROWS
MATCH (n:Identity {identifierValue:'identifier1'})
MATCH (n)-[:ON_20250628]-(i:IntermediaryDay)
WITH DISTINCT(i)
OPTIONAL MATCH (i)-[c:RELATIONSHIP_B]-()
CALL (c) { 
	DELETE c
} IN TRANSACTIONS OF 1000 ROWS
SET i.deleted=true

My best guess is that by carrying i across 2 call blocks it is somehow fully queried and stored in transaction memory. But really, I have no idea. I would love to know though, otherwise I have just doubled my search space instead of solving the problem.

Maybe you'll need to upgrade the RAM or change the configuration?

I also have no idea if fetching the entire node vs fetching the elementID would make any difference memory utilisation.

Currently sitting with 10GB of transaction memory, which is what is being reported as hitting the limit. I can't go much passed this because our delivered systems would need upgraded ($$$) licenses then.

Honestly, I would be happy to confirm that my understanding of the CALL IN TRANSACTIONS pattern is correct and I haven't made a stupid mistake there. Then I can start looking at other ways to solve this (changing the model etc).

It may be that your chaining of two queries is causing a cartesian product to result after the second match and you have exploded the number of rows. Adding the DISTINCT(i) after the first subquery will reduce the number of rows to just the different values of 'I'. I added another DISTINCT(i) after the subquery to make the SET more efficient. Maybe this works.

MATCH (n:Identity {identifierValue:'identifier1'})
MATCH (n)-[:ON_20250628]-(i:IntermediaryDay)
OPTIONAL MATCH (i)-[:RELATIONSHIP_A]-(p)
CALL (p) { 
	DETACH DELETE p
} IN TRANSACTIONS OF 1000 ROWS
WITH DISTINCT(i)
OPTIONAL MATCH (i)-[c:RELATIONSHIP_B]-()
CALL (c) { 
	DELETE c
} IN TRANSACTIONS OF 1000 ROWS
WITH DISTINCT(i)
SET i.deleted=true

Did adding DISTINCT make a different, as Gary suggested?

Another thing you could try is executing the two delete subqueries using CALL {} IN CONCURRENT TRANSACTIONS. There is a brief dicussion of using this approach to delete large datasets towards the end of this blog .

Unfortunately I had run out of test data by that stage, and I can't bring our lab system to it's knees again. But considering that there was only 1 object in the DB that was being matched to i, would adding DISTINCT have made a difference?
The cartesian product is an intersting angle I had not considered. Splitting it into 2 queries (as in my update) ran through and deleted everything successfully, so perhaps there is something there.

Let's look at the query in its reduced form to see how the data expands. Here is the test data I used. In this case, i is unique.

CREATE (n:Identity {identifierValue:'identifier1'})
CREATE (n)-[:ON_20250628]->(i:IntermediaryDay)
WITH i
unwind range(0,10) as index
CREATE (i)-[:RELATIONSHIP_A]->()
CREATE (i)-[c:RELATIONSHIP_B]->()

If we remove the delete operations and look at only the match results, we see that the first match products 11 rows (which equals the number of RELATIONSHIP_A relationships created).

Now, if we add in the second match, we get 121 rows back (11 x 11) :

Let's just look at returning i, as that is effectively what your query needs to perform the details. Again, we get 121 rows. This makes sense, because of the cartesian product being formed between the two match operations. This is composed of all duplicate data.

The second query is a function of only the value of i, not the value of p, so we can reduce the information down to just the information needed for the second query with a distinct operation.

As you can see, the 121 records got reduced to 11. This is equal to the number of different RELATIONSHIP_B relationships created. The number of distinct values of i being fed to the second query is just one. The second query expanded the row count to 11.

We can reduce the output again with a second distinct.

Considering what you described as your node and relationships, the result of the cartesian product following the second query could have exploded your row count with a lot of redundant data, causing your out of memory issues. The out of memory issue must have occurred following the second match, as each query/delete worked in isolation when your separated the two into two independent query. This separation removed the cartesian product.

1 Like

This makes sense. Thanks very much!

1 Like