Neo4j "vectorizer" - how to automatically adjust embeddings when original data is changed

Hi, Neo4j 5 has great capabilities of storing, indexing and searching across vectors. But creating vector embeddings and updating them as the original data changes still has to be done manually or through some sort of custom code each time they are created, updated and deleted.

Based on this interesting article about “vectorizers” (Vector Databases Are the Wrong Abstraction), I wonder what would be the best approach to implement such kind of “vectorizer” in Neo4j?

The goal is to automatically create a vector representation (embedding) of source data (e.g. a property or a set of properties) and automatically update the embeddings when the source data has changed, similar to a specialized index.

Is there a best practice to solve this?
Is there anything planned for future versions?

Best,
Reiner

Great question. My initial thoughts, it depends. Strictly speaking, if you want your data to be consistent and correct (at all times), you would not update the graph until you also have the new embedding so it can happen in the same transaction. But I doubt any system would be nice with that "slow writes" experience.

Do I get the linked article right that they suggest updating vectors asynchronously? I think that can be done in so many ways. But yes, it would require a bit of coding.

The pattern with a worker that does this, looks good to me. But I guess it would need to know things about what source (like what properties) that should be used for an embedding. So to make it generic would probably result in a module with tons of conf (don't like that).

As of today, I think many projects use tools like langchain. I like to think of these tools as fitting in the space of solving the "orchestration" or data integration needs. And if I look at the "vectorizer" from a very high level. It would be a "software" that can integrate with any database and call any api. Now, where have I heard that before?

Thank you for your thoughts. Do you mean something like Kafka to stream the changes to a tool that generates and writes back the embeddings?

To be honest, my hope was to get reply like "yes sure we're just working on such kind of an vectorizing index that will be included with the next enterprise version" ;-)

Best,
Reiner

If kafka is your poison, then yes, I would write my event processor. If you are more doing things in batches on spark, then use it. If you are a big fan of lambda .... There are so many options.

If you want to take short cuts and have a straw man's solution, I think there are options to call open ai (or any api from apoc/genai.vector.encode https://neo4j.com/docs/cypher-manual/current/genai-integrations/#single-embedding) already. But then again, using these "on write" would probably not be fast enough. So you would still have to come up with something that runs it after as an async task.

We are totally based on C# and Azure, so writing an own event processor seems no option.

We might call open ai within the action of apoc.triggers - but are they reliable? Never worked with them before. I assume they don't affect the write performance?

I would not waste time learning triggers in apoc, I would rather invest the time in creating an azure function (or similar). I think you will have more control and a better experience with that. If you are a C# shop, then becoming fluent in using the .net driver probably have more longterm value and short term success.

My naive solution would be like this:

  • The api/service that insert/update data stamps an extra Label on the node if any of the info to embed was modified
  • The "vectorizer service" looks for nodes the extra label and does it work (can be a function polling)

Yes azure functions is easy for us, db access is done very convenient with the C# Neo4jClient package (thanks to @charlotte.skardon). Embeddings are fetched via Azure OpenAI API. All fine.

But still I have no idea how to recognize which embeddings to recalculate when the relevant properties were added or changed without implementing modifications to our main application.

The api/service that insert/update data stamps an extra Label on the node if any of the info to embed was modified

This is exactly what I wanted to avoid as there are so many places in code that might add, change or delete these nodes or properties. Unfortunately it is not an api service.

Sure, if I want to vectorize Application.JobTitle I can add another property called JobTitleEmbedded that is set by the worker together with the embeddings itself. Then I could compare JobTitle against JobTitleEmbedded and recognize changes. But this would result in millions of extra properties with duplicate data and the azure function has to compare them all every x second/minute to find changes. That doesn't sound very elegant and will put quite a load on the database.

Thanks for the additional context. It sounds like identifying where changes have been made is the hardest problem to solve.

Maybe CDC is an option (all or with some filter Introduction - Change Data Capture)?

:100:

2 Likes

Thank you very much! Never heard of CDC before, but it seems to be exactly what we need to easily query the changes without any need to touch the code of the main application. :-)

I will give it a try as soon as we have upgraded our application from 4.4 to the needed version. (to get there, "just" 120 lines of several cypher queries has to be re-written has we heavily used collect() within map projections which is no longer supported in 5.x...)

Best,
Reiner