What is the most efficient way to Delete 20 Million Nodes from a database with over 7.5 Billion Nodes and 15 Billion relationships?

Hi Benjamin,

Did AWS get back to you on why the disk was being underutilized? We are seeing 1/10th the max throughput on our Neo4j box, even though iotop says 99% for most threads.

Any ideas?

Yes, get a different EC2 type.. After working for a few months on various boxes I found the only way Neo4j is performant is on i3 type boxes. EBS will not perform well under high IO loads and it throttles it based on size of box, i.e. 16xlarge types get less of a throttle. Using local instance storage, although ephemeral and a bit of a pain to back up to ebs/s3 as a GZIP, is 10-20x faster. If you don't want to use i3 then look at the nitro boost systems. We did not go that route because the m5 and c5 instance types with nitro boost didn't offer enough ram and has too much CPU for us to use properly. i3 or i3en are your best bets. I get reads and write exceeding 1000M/s whereas with EBS on a 32000 IOPS which are very expensive I might add, were only gett 100k/s if that.

5 Likes

Thanks Ben,

We have a database that is 2.3TB in size so your suggestion will really help us manage the scale.

Thanks benjamin, that helps a lot. We will be using equivalent machines on Azure (Ls_v2). Out of curiousity, how do you managed your database backups? If I understand correctly, having a non ebs store means that your data is much more susceptible to data loss.

Yes, if the EC2 freezes or bugs out the db needs to be started from scratch. We are working on our own auto deployment but at a minimum of keeping data backed up use pigz or zstd on a daily basis, upload it to s3. I ran a compression test between zstd and gzip with a 170 GB tarfile
zstd - start 16:58, end 17:28, size 58G
gzip - start 18:01, end 20:38, size 56G
Note as you hit TB it will inevitably get slower.
Alternatively you could just upload the database directly to s3. As long as it is stored in the cloud somewhere it is not too bad.

Ah, we used to do something similar. But with 2.3Tb of data its painfully slow. Now we run neo4j-backup on another machine and snapshot that disk. We keep 3 snapshots around incase one of them is corrupt.

How does your neo4j-backup on another machine and snapshot work? Could you describe the process, I worried about the back up and we have not scaled to the full db size yet which makes me worried for when we reach out 2 + TB db mark

The backup tool can be run on an external machine, as long as the backup port of your production neo4j is accessible to that machine. Most of the workflow is managed through a buch of lambdas, but in a nutshell here is what happens:

  1. Allocate a new ec2 machine with neo4j installed
  2. Mount a new volume with enough space to store a full backup of production neo4j
  3. Initialize the backup process by calling neo4j-backup -from $EXTERNAL_IP on the newly allocated box
  4. Wait for the backup process to finish, once done initiate EBS snapshotting
  5. Wait for snapshotting to finish, then deallocate machine and destory EBS volume

You can choose to keep the backups lying around on your EBS volume rather than taking a snapshot. We mainly do it to save cost as 4TB of SSD is much more expensive than storing a 2.3TB snapshot. If you have a large database, you also want have decent bandwidth on your backup machine.

1 Like