Odd result when using langchain Neo4jVector.from_existing_graph

I'm seeing an odd result when using Neo4JVector.from_existing_graph that I hope someone can shed some light on.

The short story is that embedding a property with a string value, then doing a similarity search for that exact string value does not return a 100% match.

The attached python notebook compares 2 methods of embedding a text property in a single node labelled "EmbeddingTest".

Method 1 creates a vector index manually, then embeds a string value, then saves that vector back to Neo4J. This vector is EmbeddingTest.embedding_text_1.

Method 2 uses Neo4JVector.from_existing_graph to create the index, perform the embedding ansd save the vector back to Neo4J as a single step. This vector is EmbeddingTest.embedding_text_2.

A similarity search is performed using both vectors. Method 1 score is 1.0 as expected, but method 2 is 0.973. Why??? This should be an exact match.

Attached is a python notebook with this test scenario and screen shot showing the vectors are indeed different even though the embedding settings are the same.

My only hunch is that Method 2 is embedding some meta data in addition to the node property value, but I can't find any evidence that is the case.

Any ideas or insight would be greatly appreciated.

embedding_test.py.txt (3.9 KB)

Hi Kurt,

Looking at the Langchain code... it appears that the string which is actually embedded is:

\n[PROPERTY NAME]:property value

(without the brackets). So, the embedding is for:

\ntext:This is the sample content that is used for the embedding test.

Thanks John! I was guessing it was something like this so thank you for the details.

I've confirmed this by doing a similarity search using

search_text="\ntext:"+test_text

and it indeed comes back with a score of 1.0 for method 2.

It seems like this is somewhat important and probably should be noted clearly somewhere in the langchain docs. I would imagine there could be some significant unexpected effects for small text values where the property name itself results in unexpectedly high match scores. Or in the case of empty values, they will all return a near exact match just for the property name.

-Kurt

1 Like