cancel
Showing results for 
Search instead for 
Did you mean: 

Storing potentially large nodes in Neo4j

krasi_964
Node

Hello everyone,

This is my first post on the Neo4j Community Forum so I apologize if it in any way violates the rules and standards of the forum. I will update my topic if any violation is pointed out.

Currently, my team is considering using Neo4j for a new service. We tried a few POCs and it looks like Neo4j is really convenient to work with for our purposes. We, however, have one issue which might turn out to be a problem.

Since we are only interested in a few fields of the data that we are storing in nodes for filtering, indexing, etc. only those fields are stored as properties in nodes. The rest of the fields (we are not interested in) are all stored in one "payload" property (currently as JSON string, but can be anything and should be thought of as one "large string") which is only stored and retrieved (we won't be doing any indexing, filtering, or other "expensive" operations based on that property ).

In some cases (NOT all of them, maybe like 10%) that "payload" property can be quite large though (~10MB or more). I did my research on storing such "big" nodes in Neo4j and I was left with the feeling that they are generally discouraged and the query performance can potentially decrease but I am not sure if that applies to our case. I understood that having a big amount of properties might become a problem but we won't be having many (maybe 4-5 others and one "payload" property which is only stored and retrieved).

So any ideas - will the DB perform badly with nodes modeled that way?

Thanks in advance for all your answers!

1 REPLY 1

jo_nathan
Node Clone

Welcome!
I just ran into this issue today with my first Neo4J project. Here is what I have learned so far.
A few of my nodes have huge properties: 70k elements in lists which equates to 1.3 MB

I am able to retrieve such a node.

match(n:Label1{name:"dummy"})
return n;

in about 5 seconds.
But whenever these nodes are part of a query, where some relationships to other nodes exist, something goes wrong very badly and the DB becomes slow/unsresponsive/crashes.

match(n:Label1{name:"dummy"})-[r]->(s:Label2)
return count(r);
// Returns 55

There are 55 relationships from this dummy node to 55 nodes with Label2,

match(n:Label1{name:"dummy"})-[r]->(s:Label2)
return n,s;
// crashes the DB

Since every row contains that enourmous list with 70k elements over and over again, this will result in 55*1.3MB=70MB data being retrieved. In my case, that more or less crashes the DB.

What to do about it?

1. Choose which properties to retrieve

match(n:Label1{name:"dummy"})-[r]->(s:Label2)
return n.name,s;
// only n.name is returned

2. Exclude the payload property

match(n:Label1{name:"dummy"})
with keys(n) as k1, n 
unwind k1 as k2
with n,  k2 where k2 <> "payload"
return collect([k2,  n[k2]]);
// Format of the returned data is different to the default!

or

match(n:Label1{name:"dummy"})
RETURN [x in keys(n) WHERE not x in ["payload"]| [x, n[x] ] ] as nc
// Format of the returned data is different to the default!

or using apoc.map.removekey()
I found these methods here:

Note that with these manipulations the returned fields are just a bunch of data, not in a format that is being recognized as being a node.
//////////////////
This is where I am currently at: How to exclude a property but still keep the same format? / Bring the data back into the correct format
Maybe apoc.create.vNode might be useful?

If anyone knows, let me know too !

Edit: Here is a way to do use apoc.create.vNode, keep in mind that is a brand new node, virtual relationships must be created too.

match(n:Label1{name:"dummy"})
return apoc.create.vNode(labels(n),apoc.map.removeKey(n, 'payload')) as node;
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.