Neo4j index query returns different nodes when changing numberOfNearestNeighbours

I'm using the CALL db.index.vector.queryNodes function to build a RAG system based on a graph db. I've noticed that the nodes returned when calling the index change based on the numberofNearestNeighbours parameter set. That is, calling the index and setting the paramater to n returns a different set of nodes than calling the same index while setting the parameter to n+m and then extracting the first n nodes from the returned list (ordered by their score). Furthermore, from the results of various experiments, I noticed that the second approach consistently provides more relevant nodes.

I collected some data on my db. On method 1 I just queried the index setting the parameter to 3 in order to extract the 3 most relevant chunks based on the embedding of the query, on method 2 I set the index to 7, ordered the returned list and then extracted the first 3 nodes.
Note that the embedding of the query is exactly the same, I do not recompute it.

Here are the results:

  • Differences Detected: 1823 in a 5223 sample (34.9%)
    In a dataset of 5223, 1823 times the lists returned by the two approaches were different.

  • Average difference in cosine similarity: -0.01135
    When the list were different, the nodes that were different achieved, in the second approach, on average, a cosine similarity of 0.01135 points higher than those retrieved by the first approach, with a variance of 0.000157. This lead the second approach to retrieve the chunks that actually contained the answer 56.14% times, while the first approach only achieved 50.89%, so as you can see the difference is not trivial.

I attach the python code I used to obtain these metric:

detection1 = 0
detection2 = 0

for el in tqdm(df):
    # Search for top 3 pages
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        retrieved_chunks1, _, _ = driver.execute_query(
            """CALL db.index.vector.queryNodes("vector", 3, $embedding)
            YIELD node, score
            RETURN ID(node), score""",
            embedding=el["embedding"], question=el["question"]
        )
    retrieved_chunks1 = set([x["ID(node)"] for x in retrieved_chunks1])
    # Search for top 7 pages
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        retrieved_chunks2, _, _ = driver.execute_query(
            """CALL db.index.vector.queryNodes("vector", 7, $embedding)
            YIELD node, score
            RETURN ID(node), score""",
            embedding=el["embedding"]
        )
    # Get top 3 pages for method 2
    # Order the list just to make sure
    retrieved_chunks2 = sorted(retrieved_chunks2, key=lambda d: d['score'], reverse=True)
    retrieved_chunks2 = set([x["ID(node)"] for x in retrieved_chunks2[:3]])
    # If sets are the same, skip to next iteration
    if not list(retrieved_chunks1 - retrieved_chunks2):
        continue
    # Avg sim for list 1
    cos_sim_list = []
    for node_id in retrieved_chunks1 - retrieved_chunks2:
        with GraphDatabase.driver(URI, auth=AUTH) as driver:
            cos_sim, _, _ = driver.execute_query(
                """MATCH (a:Chunk)
                WHERE ID(a) = $node_id
                RETURN vector.similarity.cosine(a.embedding, $embedding) AS score""",
                embedding = el["embedding"], node_id = node_id
            )
        cos_sim_list.append(cos_sim[0]["score"])
    avg_cos_sim1 = sum(cos_sim_list)/len(cos_sim_list)
    # Avg sim for list 2
    cos_sim_list = []
    for node_id in retrieved_chunks2 - retrieved_chunks1:
        with GraphDatabase.driver(URI, auth=AUTH) as driver:
            cos_sim, _, _ = driver.execute_query(
                """MATCH (a:Chunk)
                WHERE ID(a) = $node_id
                RETURN vector.similarity.cosine(a.embedding, $embedding) AS score""",
                embedding = el["embedding"], node_id = node_id
            )
        cos_sim_list.append(cos_sim[0]["score"])
    avg_cos_sim2 = sum(cos_sim_list)/len(cos_sim_list)
    avg_diff_list.append(avg_cos_sim1 - avg_cos_sim2)

perc = (differences_detected/len(df)) * 100
print(f"Differences Detected: {differences_detected} in a {len(df)} sample ({round(perc,2)}%)")
m = sum(avg_diff_list)/len(avg_diff_list)
print(f"Average difference in cosine similarity: {round(m,6)}")
print(f"Variance of difference in cosine similarity: {round(sum((xi - m) ** 2 for xi in avg_diff_list) / len(avg_diff_list),6)}")`

db version: 5.21.0

Why does this happen? Am I doing something wrong?

Davide,

Thanks for your question. The short answer is "no, you're not doing anything wrong." The algorithm finds the approximate nearest neighbors, and returning more top "k" results increases the chance that the approximation is closer to reality, at the expense of performance.

J

1 Like