So, I have a database that is storing artists and song collaborations. I have just hit ~1M artists (nodes) and 3.3M COLLABORATED_WITH edges.
My edges are somewhat "heavy" as they include arrays of UIDS, urls, and names of songs.
My problem arises as my actual size is almost 100x larger than my expected size.
I've run the following queries to estimate what the size/max size of my edges are:
MATCH ()-[r:COLLABORATED_WITH]->()
WITH
reduce(s = 0, x IN r.songUris | s + size(x)) +
reduce(s = 0, x IN r.songNames | s + size(x)) +
reduce(s = 0, x IN r.albumUris | s + size(x)) +
reduce(s = 0, x IN r.images | s + size(x)) AS totalSize
RETURN avg(totalSize) AS avgEdgeSizeBytes, max(totalSize) AS maxEdgeSizeBytes
with the result:
|avgEdgeSizeBytes|maxEdgeSizeBytes|
|632.182172431995|4834613|
Nodes are much smaller:
MATCH (n:Artist)
WITH n,
size(toString(n.spotifyId)) + size(toString(n.name)) +
size(toString(n.image)) + size(toString(n.popularity)) +
size(toString(n.crawlStatus)) AS totalSize
RETURN avg(totalSize) AS avgNodeSizeBytes, max(totalSize) AS maxNodeSizeBytes
|avgNodeSizeBytes|maxNodeSizeBytes|
|43.92880912244179|274|
With these estimates, and my total # of nodes and edges, I come out to about 1-2GB of storage. Instead, my DB uses roughly 160GBs. I assumed this must be a result of very very bad fragmentation, but I've already attempted to dump my db (which dumps to ~8GB) and reimport, but it still expands out to 160GB. Another weird issue is that my "block.big_values.db" file is absolutely massive at roughly 157GBs.
Could anyone help in diagnosing my issue further? I'd assume it's related to my edges being so data dense, but even after trimming all of my edges to no more than 50 entries and doing a dump/reload, I still have the same issue.