Issue with extremely large block.big_values.db file

So, I have a database that is storing artists and song collaborations. I have just hit ~1M artists (nodes) and 3.3M COLLABORATED_WITH edges.

My edges are somewhat "heavy" as they include arrays of UIDS, urls, and names of songs.

My problem arises as my actual size is almost 100x larger than my expected size.
I've run the following queries to estimate what the size/max size of my edges are:

MATCH ()-[r:COLLABORATED_WITH]->()
WITH 
  reduce(s = 0, x IN r.songUris | s + size(x)) +
  reduce(s = 0, x IN r.songNames | s + size(x)) +
  reduce(s = 0, x IN r.albumUris | s + size(x)) +
  reduce(s = 0, x IN r.images | s + size(x)) AS totalSize
RETURN avg(totalSize) AS avgEdgeSizeBytes, max(totalSize) AS maxEdgeSizeBytes

with the result:

|avgEdgeSizeBytes|maxEdgeSizeBytes|
|632.182172431995|4834613|

Nodes are much smaller:

MATCH (n:Artist)
WITH n,
     size(toString(n.spotifyId)) + size(toString(n.name)) + 
     size(toString(n.image)) + size(toString(n.popularity)) + 
     size(toString(n.crawlStatus)) AS totalSize
RETURN avg(totalSize) AS avgNodeSizeBytes, max(totalSize) AS maxNodeSizeBytes
|avgNodeSizeBytes|maxNodeSizeBytes|
|43.92880912244179|274|

With these estimates, and my total # of nodes and edges, I come out to about 1-2GB of storage. Instead, my DB uses roughly 160GBs. I assumed this must be a result of very very bad fragmentation, but I've already attempted to dump my db (which dumps to ~8GB) and reimport, but it still expands out to 160GB. Another weird issue is that my "block.big_values.db" file is absolutely massive at roughly 157GBs.

Could anyone help in diagnosing my issue further? I'd assume it's related to my edges being so data dense, but even after trimming all of my edges to no more than 50 entries and doing a dump/reload, I still have the same issue.

I did some analysis for you to try to figure what is your edge distribution, it doesn't seem to fit that you have a normal, binomial or poisson distribution (the common ones)

So some guess-timations:

  • if 2% of your edges are 50% of the max-edge size ... you get to 148 GB (only on those edges)
  • the other way around, if 5% of your edges are 20% of the max-edge size ... you also get to 148GB

You might want to re-think if those collaborations aren't actually a graph

(John) -[:collaborated]->(some_song)<-[:collaborated]-(Paul)

it could even be that you need 'tiers' of nodes:

(some_song)-[:hasVersion]->(version_song)
(Ringo)-[:composed]->(some_song)
(John) -[:collaborated]->(version_song)<-[:collaborated]-(Paul)

And maybe I am wrong, but r.images: if that's a URI - fine, if that's an actual JPG/PNG - definitely doesn't belong inside a database.

I really appreciate your response! I'm currently still in the process of troubleshooting exactly what is causing my issue, but I can respond to some of your responses.

Yes, r.images is an array of urls. I agree that storing raw images would be a bad idea!

  • if 2% of your edges are 50% of the max-edge size ... you get to 148 GB (only on those edges)
  • the other way around, if 5% of your edges are 20% of the max-edge size ... you also get to 148GB

The problem with this line of reasoning is that I included the average size also-this INCLUDES those outliers, and is only ~600B per edge.

avgEdgeSizeBytes 632.182172431995

Some more information:

MATCH ()-[r:COLLABORATED_WITH]->()
WHERE size(r.songUris) > 100
RETURN count(r) AS edgesOver100

edgesOver100
6630

I only have 6k edges over 100 songs

MATCH ()-[r:COLLABORATED_WITH]->()
WHERE size(r.songUris) > 100
RETURN 
  size(r.songUris) AS numSongs,
  reduce(total = 0, uri IN r.songUris | total + size(uri)) +
  reduce(total = 0, name IN r.songNames | total + size(name)) +
  reduce(total = 0, album IN r.albumUris | total + size(album)) +
  reduce(total = 0, img IN r.images | total + size(img)) AS approxCharacters
LIMIT 1

The rough approx size of an edge with over 100 songs is about 15kb? My guess here is that size returns roughly 1 byte per char, but I could definitely be wrong here.

numSongs	approxCharacters
101      	15244

But, even in the extreme cases, I only have 433 edges with over 500 songs, of which the estimate size was 1.5 mbs per edge (which would only explain perhaps 1GB of extra storage!)

Definitely the outlier edges are the problem in one way or another. By truncating all edges with over 50 songs to 50, exporting with a APOC query, and reimporting, I was able to get it down to ~4GB. (which makes sense based on the averages I calculated!)

So, I'm still curious what the issue is with these large edges! I have the original database still in place and I'll continue to push until I find what the real 150GB issue is.

@joshualawson8

is there any detail on what version of Neo4j is in play here?

Per Neo4j 5 - Graph Database & Analytics

Fixes

Release Note for Neo4j 5.25.1
(31 October 2024)
Kernel

Kernel Fixes
Fixes an issue with unexpected store size growth in block format 
after deleting data, caused by large property values, resulting in 
deleted large value records rarely being reused.

and if not mistaken if you ever ran prior to Neo4j 5.25.1, then the fix simply prevents new unexpected growth. The fix does nothing to address the prior unexpected growth. However running neo4j-admin database copy should clear out the fragmentagtion?

Is there any detail of what version is in play here?

Yeah! I was running 5.24.0 here. That must be the issue! I ran neo4j-admin database copy and my size went down to 2.5 gbs!

I'll upgrade my database-unfortunately when I downloaded Neo4j desktop earlier this month, it provided me with 5.24.0 as a default.

Thank you very much for the help!

I think a large part of my issue was I updating "heavy" edges very frequently, and I'd image these were being deleted and then recreated.

@joshualawson8

Thanks for the update. In the future when reporting new issues on community.neo4j.com including the version detail is a best practice

Of course! Thank you very much. I'll make sure to do that in the future. : )