I'm using the CALL db.index.vector.queryNodes
function to build a RAG system based on a graph db. I've noticed that the nodes returned when calling the index change based on the numberofNearestNeighbours
parameter set. That is, calling the index and setting the paramater to n returns a different set of nodes than calling the same index while setting the parameter to n+m and then extracting the first n nodes from the returned list (ordered by their score). Furthermore, from the results of various experiments, I noticed that the second approach consistently provides more relevant nodes.
I collected some data on my db. On method 1 I just queried the index setting the parameter to 3 in order to extract the 3 most relevant chunks based on the embedding of the query, on method 2 I set the index to 7, ordered the returned list and then extracted the first 3 nodes.
Note that the embedding of the query is exactly the same, I do not recompute it.
Here are the results:
-
Differences Detected: 1823 in a 5223 sample (34.9%)
In a dataset of 5223, 1823 times the lists returned by the two approaches were different. -
Average difference in cosine similarity: -0.01135
When the list were different, the nodes that were different achieved, in the second approach, on average, a cosine similarity of 0.01135 points higher than those retrieved by the first approach, with a variance of 0.000157. This lead the second approach to retrieve the chunks that actually contained the answer 56.14% times, while the first approach only achieved 50.89%, so as you can see the difference is not trivial.
I attach the python code I used to obtain these metric:
detection1 = 0
detection2 = 0
for el in tqdm(df):
# Search for top 3 pages
with GraphDatabase.driver(URI, auth=AUTH) as driver:
retrieved_chunks1, _, _ = driver.execute_query(
"""CALL db.index.vector.queryNodes("vector", 3, $embedding)
YIELD node, score
RETURN ID(node), score""",
embedding=el["embedding"], question=el["question"]
)
retrieved_chunks1 = set([x["ID(node)"] for x in retrieved_chunks1])
# Search for top 7 pages
with GraphDatabase.driver(URI, auth=AUTH) as driver:
retrieved_chunks2, _, _ = driver.execute_query(
"""CALL db.index.vector.queryNodes("vector", 7, $embedding)
YIELD node, score
RETURN ID(node), score""",
embedding=el["embedding"]
)
# Get top 3 pages for method 2
# Order the list just to make sure
retrieved_chunks2 = sorted(retrieved_chunks2, key=lambda d: d['score'], reverse=True)
retrieved_chunks2 = set([x["ID(node)"] for x in retrieved_chunks2[:3]])
# If sets are the same, skip to next iteration
if not list(retrieved_chunks1 - retrieved_chunks2):
continue
# Avg sim for list 1
cos_sim_list = []
for node_id in retrieved_chunks1 - retrieved_chunks2:
with GraphDatabase.driver(URI, auth=AUTH) as driver:
cos_sim, _, _ = driver.execute_query(
"""MATCH (a:Chunk)
WHERE ID(a) = $node_id
RETURN vector.similarity.cosine(a.embedding, $embedding) AS score""",
embedding = el["embedding"], node_id = node_id
)
cos_sim_list.append(cos_sim[0]["score"])
avg_cos_sim1 = sum(cos_sim_list)/len(cos_sim_list)
# Avg sim for list 2
cos_sim_list = []
for node_id in retrieved_chunks2 - retrieved_chunks1:
with GraphDatabase.driver(URI, auth=AUTH) as driver:
cos_sim, _, _ = driver.execute_query(
"""MATCH (a:Chunk)
WHERE ID(a) = $node_id
RETURN vector.similarity.cosine(a.embedding, $embedding) AS score""",
embedding = el["embedding"], node_id = node_id
)
cos_sim_list.append(cos_sim[0]["score"])
avg_cos_sim2 = sum(cos_sim_list)/len(cos_sim_list)
avg_diff_list.append(avg_cos_sim1 - avg_cos_sim2)
perc = (differences_detected/len(df)) * 100
print(f"Differences Detected: {differences_detected} in a {len(df)} sample ({round(perc,2)}%)")
m = sum(avg_diff_list)/len(avg_diff_list)
print(f"Average difference in cosine similarity: {round(m,6)}")
print(f"Variance of difference in cosine similarity: {round(sum((xi - m) ** 2 for xi in avg_diff_list) / len(avg_diff_list),6)}")`
db version: 5.21.0
Why does this happen? Am I doing something wrong?