Parallel deletes with apoc.periodic.iterate

jameslaneconkling · April 2, 2019, 6:59pm

Running Deletes on parallel apoc.periodic.iterate calls results in a high number of java.lang.NullPointerException errors. E.g.,

CALL apoc.periodic.iterate(
  "MATCH (n { foo: 'bar' }) RETURN n",
  "DETACH DELETE n",
  { batchSize: 1000, parallel: true, concurrency: 50, retries: 3 }
)

returns

{ "batch": {"total":538,"committed":469,"failed":69,"errors":{"java.lang.NullPointerException":69}}, ... }

I would expect this on queries with overlapping nodes, where multiple batches try to delete the same node. But in the above trivial case, there are of course no overlapping nodes across batches, so I imagine it might result from concurrent attempts to drop edges. I.e., if two batches contain nodes that are connected, each batch will try to drop the shared edge, and result in a null pointer error.

Is the above intuition correct? If so, what can I do to mitigate? And more generally, is there anything I can do to speed up bulk deletes beyond what a non-parallel call to apoc.periodic.iterate can provide?

benjamin.squire · April 2, 2019, 7:26pm

These were my attempts. No perfect solution yet but MH has some ideas.

jameslaneconkling · April 11, 2019, 10:17pm

@benjamin.squire thanks for sharing--a lot of food for thought there. I was able to implement a solution following your 3 method: in parallel, mark all nodes for deletion; in sequence, delete all relationships; in parallel, delete all nodes. Unfortunately, it's about as efficient as my original non-parallelized approach. I'm confused as to why relationships cannot be dropped in parallel. Even when iterating over non-overlapping subgraphs using apoc.path.subgraphAll to ensure that all subgraph relationships are deleted in the same thread, I'm still seeing a large number of NullPointerExceptions.

CALL apoc.periodic.iterate(
  "MATCH (metadata:METADATA)
  WHERE metadata.uri STARTS WITH $sourceUri
  RETURN metadata",
  "CALL apoc.path.subgraphAll(metadata, {relationshipFilter: 'METADATA>,RELATIONSHIP'}) YIELD relationships
  UNWIND relationships as r
  DELETE r",
  { batchSize: 1000, iterateList: true, parallel: true, concurrency: 50, params: { sourceUri: $sourceUri } }
)

Can't think through what else could be going on here.

benjamin.squire · April 13, 2019, 12:12am

Relationships cannot be dropped in parallel because a relationships inherently relies on having a start and end node attached to it by definition and to do a delete on that relationship it locks onto both those nodes. If another relationship is being deleted on another thread simultaneously it cannot lock on to it as well. This is the reason you have to make sure the deletes are affecting different subgraphs in the second part of the iterate, to ensure two threads never have to lock onto the same nodes for the delete.

If there was a situation where your graph could avoid deleting relationships from the same nodes across independent threads then parallel deletes would work. I.e. subgraphs found in the first portion of the iterate and then deleting the rels per thread for a given subgraph in the second portion of iterate.

jameslaneconkling · April 22, 2019, 3:57pm

@benjamin.squire ah OK, that explains it. The subgraphs I'm iterating over in my above example ensure that all subgraph nodes are dropped in the same thread. However, because subgraphs are connected through shared nodes, dropping a subgraph resulted in parallel deletes of these shared nodes' relationships (even though the shared nodes were not being deleted themselves). So, computing disjoint subgraphs via apoc.path.subgraphAll is not sufficient to support parallel deletes, though using unionFind to calculate connected components should be (as you mentioned in the other thread).

Thanks for the feedback.

Topic		Replies	Views
Cypher Query using apoc.do.when inside apoc.periodic.iterate does not work as intended? Cypher	1	345	November 24, 2020
Apoc.periodic.iterate only writing one batch with parallel Procedures & APOC	4	757	July 29, 2020
Is there a syntax error for this apoc query? Neo4j Graph Platform	5	676	May 24, 2021
Apoc.periodic.iterate() parallelization not working with Python driver Drivers & Stacks apoc , performance , browser , neo4j-python-driver	0	42	November 4, 2024
Parallel Cypher & Apoc Cypher apoc , cypher	8	3937	June 19, 2019

Get Certified in June!

Parallel deletes with apoc.periodic.iterate

Related topics