So you've got a graph similar to something like this:
And you effectively have a vector index on :DocumentChunk
and vector
.
When you query the vector index you'd get all the :DocumentChunk
nodes and their respective similarity score, ordered by default from most, to least similar.
From that you can MATCH
on the path to find the relevant :Document
s or :Project
s.
This is a fine model for handling unstructured data of this type.
That's absolutely fine, it does mean that you may get :DocumentChunk
s from :Document
on completely different :Project
s. This may indeed be what you're looking to do, and that's fine.
It comes down to the semantics of the search you're wanting.
If you want to only find similar things within a project, then you'd need an index specific to it.
Space wise it'll be about the same, give some extra structure overhead, though compared to the amount of data, it'll won't be that much. Vectors are large structures, be sure to understand the space needs.
For example, just looking at raw data, an OpenAI 1536-d vector will take up at least 6kiB of raw space, if you can store it as a float[]
.
Cypher will actually try and store all LIST<FLOAT>
s (in cypher type land) as a double[]
You can use the db.create.setVectorProperty()
, or the soon to be db.create.setNodeVectorProperty()
procedures, to set the property with some validation, as a float[]
.
You'll have one copy in the store, one temporarily in the transaction logs (which will be rotated away eventually), and one copy in the index (the index will always convert to float[]
).
It doesn't sound like much, but as you massively increase the number of vector embeddings any database, especially ones which have a vector search index, you'll be using a lot of disk space.
Labels are cheap, please use them to name and tag data. There are lookup indexes that will be used for normal cypher queries, even if there isn't another index defined. These indexes use those labels and types, otherwise it'll have to use a store scan and a filter, which isn't ideal.
So if you did want to split things and have a finite number of projects, you can give each project its own dedicated label. So your nodes with chunks on it could have both :DocumentChunk:SpecificLabel
, and you have a vector index over :SpecificLabel
and the property vector
.
Sometimes you want to do a general query over all :DocumentChunk
s othertimes you want to specifically do a search on just that type.
Of course you still could have a vector index over all, and over specific … and the semantics may make sense for that, but that would be expensive on disk size.
The specific label approach doesn't really work well as the number of projects grows a lot.
For something like that you would want some form of pre/post filtering.
This is an area we are still actively investigating, developing, and improving.
Vector search is fantastic for finding the implicit relationships in unstructured data; a graph then allows you to make those implicit relationships, explicit, and adding some structure grounding the data.
Another separate option which could be use alongside all this, is to use a fulltext search. Which is better at finding specific words and phrases, whereas vectors deal in semantically similar data (according to the LLM)