The Amazing Shrinking Data Footprint on Disk

I am trying to understand some curious behavior in Neo4j that is desirable but perplexing. On occasion, the size of the stored data on disk seems to shrink? In the most recent example, I was uploading more nodes/edges via CSV bulk load. During the process, the number of nodes and edges increased, but the size on disk decreased (as reported in linux by "df -h | grep "var/lib/neo4j/data")

BEFORE: 308 million nodes, 441 million edges, 755 GB on disk
AFTER: 309 million nodes, 453 million edges, 516 GB on disk

Is there some background process that compresses data from time to time? System details are below

Dell optiplex 7010, i7-3770, 24GB ram
Ubuntu Linux 18.04
Neo4j 4.0.0 (though I observed a similar phenomenon with Neo4j 3.5x)
Driver: py2neo for python
OS drive: 250GB SSD
*A 2TB HDD formatted as ext4 is mounted to /var/lib/neo4j/data to hold the large amount of data

Thanks in advance for any insight.

It is possible that the transaction logs were cleaned up. What's your retention policy on transaction logs? They are by default retained for a week. When you are doing bulk load, transaction logs can grow large as lot of writes are happening. After week they might get removed as they are beyond retention policy.

2 Likes

@anthapu - ah- that must be it - thank you! I did a very large upload about a week ago that added ~200 million nodes and ~200 million edges. Checking the configuration file, I see:

:~$ cat /etc/neo4j/neo4j.conf | grep "retention"
dbms.tx_log.rotation.retention_policy=7 days

Thanks again!