cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Phase 2: migrate recent content

Deleting 20k nodes with about 50 rels each takes 3 days

alexchantavy
Node Clone

Hi there,

It is taking my server about 3 days to delete just 20,000 nodes in my graph, and they have at most 50 relationships each.

I am using Neo4j Python driver 1.7.6 to do this operation on an Ubuntu 14.04 LTS server with 32GB RAM running Neo4j Enterprise 3.5.19.

Here's my query:

MATCH (:AWSAccount{id: '1234'})-[:RESOURCE]->(:AWSVpc)<-[:MEMBER_OF_AWS_VPC]-(:EC2Subnet)<-[:PART_OF_SUBNET]-(n:NetworkInterface) 
WHERE n.lastupdated <> {UPDATE_TAG} 
WITH n LIMIT {LIMIT_SIZE}
DETACH DELETE (n)
RETURN COUNT(*) as TotalCompleted

UPDATE_TAG is a timestamp like 1604973243, and LIMIT_SIZE is set to 100. My code deletes nodes that don't have a given timestamp value, and as per this guide, I run the above query to delete in small batches until TotalCompleted returns 0. My problem is that this process takes days. Any ideas on why this is and how I can fix this? The actual Python code that calls this query is here, and although it could clearly be written much more safely, I don't see this slow behavior on any of my other dozens of deletion jobs.

Repro

  1. Before running my delete job, I have 21,209 nodes with a timestamp of 1604836804:
    2X_3_312b7c06594e9fd116d53f3ee16270b4be7e9ec8.png

  2. I run the delete job in my Python script as described above and check the Neo4j web interface after a couple minutes while it continues:
    2X_7_7eeab838b4bbf09ad0d8ba3cea8f9d09eb60dce0.png It seems to have deleted about 17,000 nodes.

  3. The count has been stuck here at 4300 for the past hour and I am very confused. My system memory usage stays around 31% usage out of a total 32GB, CPU hovers around 30%. When I manually run the delete query in the web interface, it works:
    2X_8_89924944597f7dfaf66c251ea079b6080296b303.png

Here's my PROFILE:

I have read these links searching for an answer but have not found one yet:

Neo4j.log and query.log don't show anything. I'd like to avoid APOC if possible because I don't want to force users of my software library to need to install anything other than a fairly vanilla Neo4j setup. Any ideas? Thanks very much in advance.

13 REPLIES 13

benjamin_squire
Graph Voyager

What is your heap and max heap and page cache in Neo4j.conf?

Is there anything in particular about those 4300? They all have at most 50 rels?

You could delete the relationships individually, this might look like:


MATCH (:AWSAccount{id: '1234'})-[:RESOURCE]->(:AWSVpc)<-[:MEMBER_OF_AWS_VPC]-(:EC2Subnet)<-[r:PART_OF_SUBNET]-(n:NetworkInterface) 
WHERE n.lastupdated <> {UPDATE_TAG} 
WITH  n,r {LIMIT_SIZE}
DELETE r SET n:DeleteMe
RETURN COUNT(*) as TotalCompleted

Then run something like

match (n:DeleteMe) Delete n limit {LIMIT_SIZE} RETURN 
COUNT(*) as TotalCompleted

What is your heap and max heap and page cache in Neo4j.conf?

My neo4j.conf has heap size:

dbms.memory.heap.max_size=5g

and cache size in the file says...

# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 50% of RAM minus the max Java heap size.
#dbms.memory.pagecache.size=10g

Admittedly I've never tuned these values. Any recommendations?

Is there anything in particular about those 4300? They all have at most 50 rels?

Nothing in particular, they have about 12 fields each and they all have at most 50 rels.

I would change that to explicit settings each of maybe 10g heap and 10g page cache. There are two settings for heap, change them both.

I upped the heap values to 10g but accidentally forgot to set pagecache - still hanged for half an hour with 8000 nodes left. I'm doing another run with pagecache set correctly and will report back.

I re-ran it with the new cache and heap settings and it's still stuck after 8 hours. Before I rewrite some code and give APOC a try, any other ideas?

Cobra
Ninja
Ninja

Hello @alexchantavy

You can use APOC:

CALL apoc.periodic.iterate('MATCH (n) RETURN n', 'DETACH DELETE n', {batchSize: 1000})

Regards,
Cobra

You can use APOC

Hi, as said in my post, I'd like to try to avoid apoc if possible. I'll give it a try if nothing else works though.

clem
Graph Steward

Do you have indices on AWSAccount.id and NetworkInterface.lastupdated? Without indexes, a linear search would need to be done which would be painfully slow (depending on many nodes there are in the DB.)

Is there a reason why AWSAccount.id are strings instead of numbers? Numbers would be faster.

NOTE: indexes are not a magic bullet. Don't index everything! Only index things that you are searching on (e.g. with conditions) or where there aren't many different values (e.g. booleans).

See: https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/

Do you have indices on AWSAccount.id and NetworkInterface.lastupdated ?

I have indexes on AWSAccount.id but not on {any label}.lastupdated. It's still odd how none of my other labels have this problem, but thanks for the idea: I'll do some testing with an index on lastupdated for just the NetworkInterface.

Is there a reason why AWSAccount.id are strings instead of numbers? Numbers would be faster.

Not really other than "someone else decided this a long time ago and now that's how it is" .

I think you can change them all to integers. (Not too hard.).

You might experiment with a clone of a DB using ints for id's instead of strings. Obviously, comparing two integers can be done in a few instructions, where as a string comparison is a loop. (To make a hash out of string, you still need a loop.)

I don't know what sorts of Java profiling tools can be used on Neo4J but maybe those can be used to find out where the bottlenecks.

alexchantavy
Node Clone

No luck, it still hangs short of finishing the last 4000 nodes. I looked closer and it appears 5 of them have 50 rels, and the rest have fewer than 5 rels each. All of them have about 12 properties each.

I'm going to go stare at code some more

benjamin_squire
Graph Voyager

At this point I don't think APOC will help you as the issue is fundamentally due to the node in particular you are deleting, not the speed. You should try and maybe delete a few of them manually one at a time.

Did you try deleting the rels independently of the Nodes?

clem
Graph Steward

Have you tried PROFILE? I haven't tried it but it looks like it could help: