It is taking my server about 3 days to delete just 20,000 nodes in my graph, and they have at most 50 relationships each.
I am using Neo4j Python driver 1.7.6 to do this operation on an Ubuntu 14.04 LTS server with 32GB RAM running Neo4j Enterprise 3.5.19.
Here's my query:
MATCH (:AWSAccount{id: '1234'})-[:RESOURCE]->(:AWSVpc)<-[:MEMBER_OF_AWS_VPC]-(:EC2Subnet)<-[:PART_OF_SUBNET]-(n:NetworkInterface)
WHERE n.lastupdated <> {UPDATE_TAG}
WITH n LIMIT {LIMIT_SIZE}
DETACH DELETE (n)
RETURN COUNT(*) as TotalCompleted
UPDATE_TAG is a timestamp like 1604973243, and LIMIT_SIZE is set to 100. My code deletes nodes that don't have a given timestamp value, and as per this guide, I run the above query to delete in small batches until TotalCompleted returns 0. My problem is that this process takes days. Any ideas on why this is and how I can fix this? The actual Python code that calls this query is here, and although it could clearly be written much more safely, I don't see this slow behavior on any of my other dozens of deletion jobs.
Repro
Before running my delete job, I have 21,209 nodes with a timestamp of 1604836804:
I run the delete job in my Python script as described above and check the Neo4j web interface after a couple minutes while it continues:
It seems to have deleted about 17,000 nodes.
The count has been stuck here at 4300 for the past hour and I am very confused. My system memory usage stays around 31% usage out of a total 32GB, CPU hovers around 30%. When I manually run the delete query in the web interface, it works:
Neo4j.log and query.log don't show anything. I'd like to avoid APOC if possible because I don't want to force users of my software library to need to install anything other than a fairly vanilla Neo4j setup. Any ideas? Thanks very much in advance.
What is your heap and max heap and page cache in Neo4j.conf?
Is there anything in particular about those 4300? They all have at most 50 rels?
You could delete the relationships individually, this might look like:
MATCH (:AWSAccount{id: '1234'})-[:RESOURCE]->(:AWSVpc)<-[:MEMBER_OF_AWS_VPC]-(:EC2Subnet)<-[r:PART_OF_SUBNET]-(n:NetworkInterface)
WHERE n.lastupdated <> {UPDATE_TAG}
WITH n,r {LIMIT_SIZE}
DELETE r SET n:DeleteMe
RETURN COUNT(*) as TotalCompleted
Then run something like
match (n:DeleteMe) Delete n limit {LIMIT_SIZE} RETURN
COUNT(*) as TotalCompleted
What is your heap and max heap and page cache in Neo4j.conf?
My neo4j.conf has heap size:
dbms.memory.heap.max_size=5g
and cache size in the file says...
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 50% of RAM minus the max Java heap size.
#dbms.memory.pagecache.size=10g
Admittedly I've never tuned these values. Any recommendations?
Is there anything in particular about those 4300? They all have at most 50 rels?
Nothing in particular, they have about 12 fields each and they all have at most 50 rels.
I upped the heap values to 10g but accidentally forgot to set pagecache - still hanged for half an hour with 8000 nodes left. I'm doing another run with pagecache set correctly and will report back.
Do you have indices on AWSAccount.id and NetworkInterface.lastupdated? Without indexes, a linear search would need to be done which would be painfully slow (depending on many nodes there are in the DB.)
Is there a reason why AWSAccount.id are strings instead of numbers? Numbers would be faster.
NOTE: indexes are not a magic bullet. Don't index everything! Only index things that you are searching on (e.g. with conditions) or where there aren't many different values (e.g. booleans).
Do you have indices on AWSAccount.id and NetworkInterface.lastupdated ?
I have indexes on AWSAccount.id but not on {any label}.lastupdated. It's still odd how none of my other labels have this problem, but thanks for the idea: I'll do some testing with an index on lastupdated for just the NetworkInterface.
Is there a reason why AWSAccount.id are strings instead of numbers? Numbers would be faster.
Not really other than "someone else decided this a long time ago and now that's how it is" .
I think you can change them all to integers. (Not too hard.).
You might experiment with a clone of a DB using ints for id's instead of strings. Obviously, comparing two integers can be done in a few instructions, where as a string comparison is a loop. (To make a hash out of string, you still need a loop.)
I don't know what sorts of Java profiling tools can be used on Neo4J but maybe those can be used to find out where the bottlenecks.
No luck, it still hangs short of finishing the last 4000 nodes. I looked closer and it appears 5 of them have 50 rels, and the rest have fewer than 5 rels each. All of them have about 12 properties each.
At this point I don't think APOC will help you as the issue is fundamentally due to the node in particular you are deleting, not the speed. You should try and maybe delete a few of them manually one at a time.
Did you try deleting the rels independently of the Nodes?