I would expect this on queries with overlapping nodes, where multiple batches try to delete the same node. But in the above trivial case, there are of course no overlapping nodes across batches, so I imagine it might result from concurrent attempts to drop edges. I.e., if two batches contain nodes that are connected, each batch will try to drop the shared edge, and result in a null pointer error.
Is the above intuition correct? If so, what can I do to mitigate? And more generally, is there anything I can do to speed up bulk deletes beyond what a non-parallel call to apoc.periodic.iterate can provide?
@benjamin.squire thanks for sharing--a lot of food for thought there. I was able to implement a solution following your 3 method: in parallel, mark all nodes for deletion; in sequence, delete all relationships; in parallel, delete all nodes. Unfortunately, it's about as efficient as my original non-parallelized approach. I'm confused as to why relationships cannot be dropped in parallel. Even when iterating over non-overlapping subgraphs using apoc.path.subgraphAll to ensure that all subgraph relationships are deleted in the same thread, I'm still seeing a large number of NullPointerExceptions.
CALL apoc.periodic.iterate(
"MATCH (metadata:METADATA)
WHERE metadata.uri STARTS WITH $sourceUri
RETURN metadata",
"CALL apoc.path.subgraphAll(metadata, {relationshipFilter: 'METADATA>,RELATIONSHIP'}) YIELD relationships
UNWIND relationships as r
DELETE r",
{ batchSize: 1000, iterateList: true, parallel: true, concurrency: 50, params: { sourceUri: $sourceUri } }
)
Can't think through what else could be going on here.
Relationships cannot be dropped in parallel because a relationships inherently relies on having a start and end node attached to it by definition and to do a delete on that relationship it locks onto both those nodes. If another relationship is being deleted on another thread simultaneously it cannot lock on to it as well. This is the reason you have to make sure the deletes are affecting different subgraphs in the second part of the iterate, to ensure two threads never have to lock onto the same nodes for the delete.
If there was a situation where your graph could avoid deleting relationships from the same nodes across independent threads then parallel deletes would work. I.e. subgraphs found in the first portion of the iterate and then deleting the rels per thread for a given subgraph in the second portion of iterate.
@benjamin.squire ah OK, that explains it. The subgraphs I'm iterating over in my above example ensure that all subgraph nodes are dropped in the same thread. However, because subgraphs are connected through shared nodes, dropping a subgraph resulted in parallel deletes of these shared nodes' relationships (even though the shared nodes were not being deleted themselves). So, computing disjoint subgraphs via apoc.path.subgraphAll is not sufficient to support parallel deletes, though using unionFind to calculate connected components should be (as you mentioned in the other thread).