Vector search index pre-filtered query

Hi everyone,

I have a graph mixing structured and unstructured data, i.e. structured nodes and relationships used as actual database, connected to nodes containing plain-text Documents.
These Document nodes are embedded and indexed using a Vector Search Index.
For simplicity, let's just say that each Document is linked to a Project node, and each Project can have one or more documents.

Now the thing is, a Project's information is not public and access is restricted per user (this is done at higher-level in the application).
This means that when I do a similarity search using the vector index on the Documents, I need to be able to filter my results so that only Documents of specified Projects are returned.
Today, I have something like this:

CALL db.index.vector.queryNodes('documentVectorIndex', $k, $embedding) 
YIELD node, score
WHERE ...

And I call this with a big enough k parameter so that the post-filtering can still always return relevant results.

But I feel like this is not really a good solution in the long run. The database will grow bigger and bigger and I suppose that using a very high k parameter will result in slower and slower queries.

Is there some way to instead do a pre-filtering and have the index only evaluate relevant Documents?
If not, is this something that could eventually be implemented in the future?
Or is there an alternative I don't know to achieve the same result?

Hi @luc
Unfortunately, the current implementation of the vector index doesn't allow for prefiltering. We have some room on the road map to investigate what we can do here at some point; however, on slight reflection, we may only be able to do some simple operations in prefiltering.

Currently only post-filtering is an option for such things; which likely means increasing the value of k to account for an estimated difference, perhaps slightly overshooting, filtering, and perhaps LIMIT back to the intended k.

This certainly isn't ideal, I agree.

It's interesting that you're effectively creating a vector space across all documents regardless of the project. I presume you're using the search to find similar documents to an input document, to then find the actual projects that contain a similar document?

If not, and the documents should be searched specifically within projects, and not across projects; perhaps having an index on each (though each project document would need its own label).

Hi, thanks for your answer.

I am still half in exploration phase, so nothing is set in stone yet. Right now I have a bunch of use cases and am trying to figure out what is or isn't possible and how.

To answer your question, the documents can go over 100 or 200 pages, so they are splitted in DocumentChunk nodes which are the ones actually indexed.

One of the use cases I'm trying to cover is to be able to ask a question, retrieve the relevant chunks, and give them as context to an LLM to answer the question. Further use cases will probably fetch more complex info from the graph than just the chunks of text, but I'll keep it simple for now.
A question can be asked either on one or multiple documents that may belong to one or multiple projects, so pre-filtering seemed like the way to go.

I did think of making one index per project, but felt a bit hesitant. I'm still new to Neo4j, so I don't really now what to expect performance- or memory-wise when the number of projects explodes, and causes the number of indexes and node labels to increase as well.
Do you have any thoughts on that?

So you've got a graph similar to something like this:

And you effectively have a vector index on :DocumentChunk and vector.

When you query the vector index you'd get all the :DocumentChunk nodes and their respective similarity score, ordered by default from most, to least similar.

From that you can MATCH on the path to find the relevant :Documents or :Projects.

This is a fine model for handling unstructured data of this type.

That's absolutely fine, it does mean that you may get :DocumentChunks from :Documenton completely different :Projects. This may indeed be what you're looking to do, and that's fine.
It comes down to the semantics of the search you're wanting.

If you want to only find similar things within a project, then you'd need an index specific to it.

Space wise it'll be about the same, give some extra structure overhead, though compared to the amount of data, it'll won't be that much. Vectors are large structures, be sure to understand the space needs.
For example, just looking at raw data, an OpenAI 1536-d vector will take up at least 6kiB of raw space, if you can store it as a float[].

Cypher will actually try and store all LIST<FLOAT>s (in cypher type land) as a double[]
You can use the db.create.setVectorProperty(), or the soon to be db.create.setNodeVectorProperty() procedures, to set the property with some validation, as a float[].

You'll have one copy in the store, one temporarily in the transaction logs (which will be rotated away eventually), and one copy in the index (the index will always convert to float[]).

It doesn't sound like much, but as you massively increase the number of vector embeddings any database, especially ones which have a vector search index, you'll be using a lot of disk space.

Labels are cheap, please use them to name and tag data. There are lookup indexes that will be used for normal cypher queries, even if there isn't another index defined. These indexes use those labels and types, otherwise it'll have to use a store scan and a filter, which isn't ideal.

So if you did want to split things and have a finite number of projects, you can give each project its own dedicated label. So your nodes with chunks on it could have both :DocumentChunk:SpecificLabel, and you have a vector index over :SpecificLabel and the property vector.
Sometimes you want to do a general query over all :DocumentChunks othertimes you want to specifically do a search on just that type.

Of course you still could have a vector index over all, and over specific … and the semantics may make sense for that, but that would be expensive on disk size.

The specific label approach doesn't really work well as the number of projects grows a lot.
For something like that you would want some form of pre/post filtering.

This is an area we are still actively investigating, developing, and improving.
Vector search is fantastic for finding the implicit relationships in unstructured data; a graph then allows you to make those implicit relationships, explicit, and adding some structure grounding the data.

Another separate option which could be use alongside all this, is to use a fulltext search. Which is better at finding specific words and phrases, whereas vectors deal in semantically similar data (according to the LLM)

3 Likes

Thanks again for this detailed answer, especially about the per-vector space costs.

The structure is indeed as you described.
If labels are cheap, then I could probably define one label per project by using a technical id handled at application level. Somethink like:

(:Project { identifier: P1234 })
(chunk:DocumentChunk:P1234)
db.index.vector.createNodeIndex('documentIndexP1234', 'P1234', 'vector', 1536, 'cosine')

That would allow to effectively have one index per project and retrieve them quite easily.

In my use cases, I think the most frequent one will be questions on one specific project. And for multi-project questionning, it will probably be a small number of projects, so I could just call the respective indexes separately and pick the best matches afterwards.

For multi-project questionning with lots of projects, I will probably make a general index with all the chunks and some post-filtering for now. At least until pre-filtering options are eventually added (if ever).

For mono-document questionning (i.e. small number of chunks), I found that calling gds.similarity.cosine manually on each chunk is actually quite quick. For a hundred or so chunks, it just took around 100ms on average. It means that for this kind of very targeted questionning, I don't even need an index, saving quite a bit of space.

Does all this sound reasonable to you?

Hi Luc, thanks for you clear question and use case; and thanks Matthew for your answers.

I have exactly the same issue, except for an internal company database of documents, from different business groups (BG). And even within a BG, we have documents we don't want to 'leak' across in the vector queries.

Before Neo4j, we were using standard vector databases, like Qdrant and others, which support the idea of "collections", which naturally allowed this filtering. So it would be great to have this features here too. But I'm going to explore the use of labels now.

Thanks, Kevin

Running into a very similar issue here! We just started exploring neo4j vector search and our graph has versioned nodes. So it would be great if we could match nodes, and then use only those nodes as the basis for a vector search. Otherwise it's likely that any search is just going to match other versions of the same nodes, rather than a given version of different similar nodes.

Over fetching and filtering is kind of hard on our end because there's no limit to the number of versions a node can have. Any suggestion or solution would be great!

The only reasonable solution we have is to only have embeddings on the most recent verisons of a node, and delete the embeddings when a new node version is created. Which is less than ideal because that means search will only be available for the latest version of the graph

Hi - same problem. And honestly, I don't understand what it even means for something like Neo4j to support vector queries without this feature.

Being able to pre-filter and then vector the results is the #1 way to improve vector precision.

For our use case, with regards to meeting intelligence, we have queries for "show me everyone Matt said about Sales in the last week's meetings". For this, we want to filter everything Matt said in those meetings, and then vector rank those results against "Sales".

Post-filtering is terrible, as it may return everything anyone has ever said about sales, and it's very easy for Matt's comments to not be in the top 300+ results.

@matthew.parnell - any update on roadmap discussions, or information around what "simple operations" means?

1 Like

Or wait... if I can't pre-filter, how do I only return vector results constrained to a workspace?

Without pre-filtering, it seems there is no such thing as a search-per-workspace or per one of my customers. Which makes the tech untenable for any application... I have to be missing something?

Hi,

My team is blocked by this as well. There are potential work-arounds, but they have potential issues which I discuss.

In our case we are building RAG across document elements, where the RAG search is pre-scoped to a document or set of documents.

There can be hundreds of thousands of documents, which in turn can have 5k elements in them each on average.

Performing a single search on a single vector index will not work (as others discovered above), unless we have a very high limit (e.g. 1k or more) as we'd have to post-filter the results to the actual set of scoped documents. We need roughly the top 20 matches, but if we only select the top 20 from ALL documents, none of those matches may be in the scoped documents of interest, even if the scoped documents have valid results within our score threshold.


Workarounds

The only workaround I see is the have one label/index per document, which we engineer automation of the creation and management of. This could mean thousands of labels and indexes in our database. Can neo4j scale like that?

Looking at the neo4j documentation, it looks like indexes can be scope to properties (if I'm wrong, then we need a unique label per document as well). The index name would need to include the document ID within it, so cypher can dynamically choose the right index(s) to use for a given RAG query.

Another workaround is to use the GDS cosine similarity function directly, after already filtering by documents. This avoids the index altogether. As the nodes are constrained to 100k or so, I need to check if this will be fast enough for us.


Does this all make sense, and does it seem feasible for Neo4j to take this workaround approach?

Thanks!
Jonah

Hi,

I'm looking for the same feature.

I have a list of N nodes and each node has an embedding property of 1536 dimensions. Can I use the GDS library to find the most similar nodes for a given embedding of a query?

We (Neo4j) miscommunicated a bit here on a technical detail. :sweat_smile: Neo4j supports KNN pre-filtered vector search, but that approach doesn't use the vector index's ANN search.

A pre-filtered search is typically in three parts:

  1. Use a graph query to filter down to the relevant nodes/vectors
  2. Calculate the similarity of each vector the the query vector and return the k nearest neighbors
  3. Use a graph query to expand and/or filter the results

Here's what that might look like

MATCH (node)
WHERE node.filter = True
RETURN node,  vector.similarity.cosine(node.vector, $query) AS score 
ORDER BY score DESC
LIMIT 10

Please note that vector.similarity.cosine will be in v5.18 (mid-March), until then the GDS equivalent also works gds.similarity.cosine.

The performance will depend heavily on the number of dimensions and vectors. From python I ran some quick tests against an 8GB 2 Core server.

                             Number of vectors
Dimensions       100       1,000     10,000    
256              0.286890  0.276566  0.601308   
512              0.239640  0.293637  0.862113   
768              0.243407  0.334232  1.173242  
1024             0.244967  0.361001  1.469159  
1280             0.249935  0.392895  1.774763  
1536             0.253224  0.422107  2.155580

Performance from many queries in parallel was good for the server spec.

   Parallel #  Dims: 1024 Limit: 1000
            1                0.246938
           10                1.239489
          100                2.540894
         1000                2.989018

Has this been included in 5.18? not entirely clear from the docs as far as i can tell. To be clear I'm looking to use this with langchain to filter first by properties and then do vector similarity search on those results.
Thanks

Awesome news. And for this pattern, how do we store the "vector" on this node object? Do we use the new vector API still for data storage?

Does this not utilize or require an index at all?

Also, btw, if there is any internal project at Neo4j to speed up this pattern of vector.similiary.consine within the Cypher engine, that would be huge.

This is the CORE vector search pattern that compounds the utility of the GraphDB technology. Without it, starting with a vector index and then post-filtering results is really messy and poorly designed solution for namespacing any data in the graph for privacy/user concerns. I basically have to make an index per user, which totally sucks.

This solves that.

text-embedding-3 from OpenAI is 1536 long. Right now our data, post fiter, maxes out at a couple thousand vectors so its acceptable But I'd forsee 10k+ vectors pretty regularly in ~3-6 months of app growth. The timing then gets kinda scary unless I can hack in better pre-filtering.

Any update or ETA on this?
Thanks :pray: