Hi There!
I'm still learning how Neo4j works, and have some questions about what's going on under the hood when data is stored and when queries traverse nodes and relationships.
This article was very helpful, but I have some follow-up questions based on the information it presents: Understanding Neo4j’s data on disk - Knowledge Base
1. Are there property storage limits?
Based on the information presented in this article, it looks like the maximum size of a property is 128 bytes, for large strings or arrays:
- The property record has a payload of 32bytes, which is divided into four 8B blocks. Each field can hold either a key or a value, or both a key and a value.
- the key and type occupies 3.5 bytes (key 4bit, type 24bit)
- boolean, byte, short, int, char, float are fitted in the remaining bytes of that same block
- a small long is also fitted in the same block
- a big long or a double are stored in a separate block (meaning two blocks used by that property)
- a reference to the string store or array store is fitted in the same block as the key
- a short string or short array is stored in the same record if it fits in the remaining blocks (including the remaining bytes of the key-block)
- long strings/arrays that do not fit in 8B blocks will have a pointer to a record on the string/array store (128B)
- … other types get more involved!
Is this true? Or is a single property that exceeds 128 bytes simply stored in a linked list of 128 byte chunks? Or something different entirely?
Which leads into my second question:
- What traversal paths do Neo4j's algorithms take?
Based on the article I linked, it appears that Neo4j stores all of its data in the following manner:
Data stored on disk is all linked lists of fixed size records. Properties are stored as a linked list of property records, each holding a key and value and pointing to the next property. Each node and relationship references its first property record. The Nodes also reference the first relationship in its relationship chain. Each Relationship references its start and end node. It also references the previous and next relationship record for the start and end node respectively.
Here's a sample query I might run:
Let's say I have a database that is storing works of fiction. A single work of fiction is a node. I'm also storing data about genres. Each genre is a node. Genres and works of fiction can be connected by a relationship.
Say I want to query for fiction that is in the "fantasy" genre. I would initially query for this particular genre node, which with an index should take Olog(n) time.
From there, I want to query for works of fiction in the "fantasy" genre that were created in the last week. Let's say I have a timestamp as a property on each relationship denoting when the work of fiction was created. Let's also say that I've created an index on that property.
When running this portion of the query, will the search algorithm have to iterate through the entire linked-list of relationships to check the creation time, which would be O(n) time? Or can it also make use of indexes on properties in this situation, reducing the complexity to Olog(n)?
I ask because I'm thinking about how efficient this type of query would be if I had millions of relationships connected to a single Genre node. Would that be an efficient way to handle things? Or is there a better modeling pattern?
I haven't been able to find any documentation on the underlying time complexity of Neo4j's traversal algorithms, so any information on this would be very useful.
Thank you!