What is the most efficient way to Delete 20 Million Nodes from a database with over 7.5 Billion Nodes and 15 Billion relationships?

sam · May 24, 2019, 8:30am

Hi Benjamin,

Did AWS get back to you on why the disk was being underutilized? We are seeing 1/10th the max throughput on our Neo4j box, even though iotop says 99% for most threads.

Any ideas?

benjamin.squire · May 24, 2019, 5:49pm

Yes, get a different EC2 type.. After working for a few months on various boxes I found the only way Neo4j is performant is on i3 type boxes. EBS will not perform well under high IO loads and it throttles it based on size of box, i.e. 16xlarge types get less of a throttle. Using local instance storage, although ephemeral and a bit of a pain to back up to ebs/s3 as a GZIP, is 10-20x faster. If you don't want to use i3 then look at the nitro boost systems. We did not go that route because the m5 and c5 instance types with nitro boost didn't offer enough ram and has too much CPU for us to use properly. i3 or i3en are your best bets. I get reads and write exceeding 1000M/s whereas with EBS on a 32000 IOPS which are very expensive I might add, were only gett 100k/s if that.

sam · May 25, 2019, 7:40am

Thanks Ben,

We have a database that is 2.3TB in size so your suggestion will really help us manage the scale.

sam · May 27, 2019, 4:21am

Thanks benjamin, that helps a lot. We will be using equivalent machines on Azure (Ls_v2). Out of curiousity, how do you managed your database backups? If I understand correctly, having a non ebs store means that your data is much more susceptible to data loss.

benjamin.squire · May 27, 2019, 5:32pm

Yes, if the EC2 freezes or bugs out the db needs to be started from scratch. We are working on our own auto deployment but at a minimum of keeping data backed up use pigz or zstd on a daily basis, upload it to s3. I ran a compression test between zstd and gzip with a 170 GB tarfile
zstd - start 16:58, end 17:28, size 58G
gzip - start 18:01, end 20:38, size 56G
Note as you hit TB it will inevitably get slower.
Alternatively you could just upload the database directly to s3. As long as it is stored in the cloud somewhere it is not too bad.

sam · May 28, 2019, 2:42am

Ah, we used to do something similar. But with 2.3Tb of data its painfully slow. Now we run neo4j-backup on another machine and snapshot that disk. We keep 3 snapshots around incase one of them is corrupt.

benjamin.squire · May 28, 2019, 5:58am

How does your neo4j-backup on another machine and snapshot work? Could you describe the process, I worried about the back up and we have not scaled to the full db size yet which makes me worried for when we reach out 2 + TB db mark

sam · May 28, 2019, 10:30am

The backup tool can be run on an external machine, as long as the backup port of your production neo4j is accessible to that machine. Most of the workflow is managed through a buch of lambdas, but in a nutshell here is what happens:

Allocate a new ec2 machine with neo4j installed
Mount a new volume with enough space to store a full backup of production neo4j
Initialize the backup process by calling neo4j-backup -from $EXTERNAL_IP on the newly allocated box
Wait for the backup process to finish, once done initiate EBS snapshotting
Wait for snapshotting to finish, then deallocate machine and destory EBS volume

You can choose to keep the backups lying around on your EBS volume rather than taking a snapshot. We mainly do it to save cost as 4TB of SSD is much more expensive than storing a 2.3TB snapshot. If you have a large database, you also want have decent bandwidth on your backup machine.

Topic		Replies	Views
Efficiently delete all the nodes and edges of a very large graph Neo4j Graph Platform performance , browser , cypher	2	2908	March 3, 2023
Deleting nodes taking forever Cypher	2	1382	August 2, 2021
Deleting 20k nodes with about 50 rels each takes 3 days Cypher performance , cypher , delete	13	873	November 14, 2020
Large Delete Transaction Best Practices in Neo4j Operations cypher , garbage-collection , heap , memory , knowledge-base , transaction	0	1252	August 23, 2018
Data Deletion Neo4j Graph Platform migrated	4	231	November 9, 2022

Get Certified in June!

What is the most efficient way to Delete 20 Million Nodes from a database with over 7.5 Billion Nodes and 15 Billion relationships?

Related topics