Odd result when using langchain Neo4jVector.from_existing_graph

kurt_slater · November 13, 2024, 5:17pm

I'm seeing an odd result when using Neo4JVector.from_existing_graph that I hope someone can shed some light on.

The short story is that embedding a property with a string value, then doing a similarity search for that exact string value does not return a 100% match.

The attached python notebook compares 2 methods of embedding a text property in a single node labelled "EmbeddingTest".

Method 1 creates a vector index manually, then embeds a string value, then saves that vector back to Neo4J. This vector is EmbeddingTest.embedding_text_1.

Method 2 uses Neo4JVector.from_existing_graph to create the index, perform the embedding ansd save the vector back to Neo4J as a single step. This vector is EmbeddingTest.embedding_text_2.

A similarity search is performed using both vectors. Method 1 score is 1.0 as expected, but method 2 is 0.973. Why??? This should be an exact match.

Attached is a python notebook with this test scenario and screen shot showing the vectors are indeed different even though the embedding settings are the same.

My only hunch is that Method 2 is embedding some meta data in addition to the node property value, but I can't find any evidence that is the case.

Any ideas or insight would be greatly appreciated.

embedding_test.py.txt (3.9 KB)

john.stegeman · November 13, 2024, 6:34pm

Hi Kurt,

Looking at the Langchain code... it appears that the string which is actually embedded is:

\n[PROPERTY NAME]:property value

(without the brackets). So, the embedding is for:

\ntext:This is the sample content that is used for the embedding test.

github.com

langchain-ai/langchain/blob/master/libs/community/langchain_community/vectorstores/neo4j_vector.py

from __future__ import annotations

import enum
import logging
import os
from hashlib import md5
from typing import (
    Any,
    Callable,
    Dict,
    Iterable,
    List,
    Optional,
    Tuple,
    Type,
)

import numpy as np
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings

This file has been truncated. show original

kurt_slater · November 13, 2024, 8:19pm

Thanks John! I was guessing it was something like this so thank you for the details.

I've confirmed this by doing a similarity search using

search_text="\ntext:"+test_text

and it indeed comes back with a score of 1.0 for method 2.

It seems like this is somewhat important and probably should be noted clearly somewhere in the langchain docs. I would imagine there could be some significant unexpected effects for small text values where the property name itself results in unexpectedly high match scores. Or in the case of empty values, they will all return a near exact match just for the property name.

-Kurt

Topic		Replies	Views
Can't get Neo4jVector.from_existing_index to work GenAI operations	1	114	November 14, 2024
Managing nodes with Vector Property Neo4j Graph Platform	2	157	April 30, 2024
Error within Langchain RetrievalQA fetching page_content from a Vector built using existing Node data Integrations & Ecosystem vector-search	2	123	May 18, 2024
Neo4jVector.from_existing_index doesn't find the vector index(created with the same embedding model) Neo4j Graph Platform langchain , rag , vector-search	0	48	April 18, 2025
What's the 'text_node_property' in Neo4jVector? Graph Algorithms/Graph Data Science	1	75	February 6, 2025

July Summer Fun!

Odd result when using langchain Neo4jVector.from_existing_graph

Related topics