Hello all,
I am having a weird scenario when I am trying to randomly select a few neighbors of particular nodes.
- neo4j version, desktop version, browser version
I am using neo4j:5.5.0-community docker instance on linux.
With the following parameter
- NEO4J_AUTH=none
# - NEO4_PLUGINS='["graph-data-science"]'
- apoc.export.file.enabled=true
- apoc.import.file.enabled=true
- apoc.import.file.use_neo4j_config=true
- NEO4J_PLUGINS=["apoc"]
- NEO4J_dbms_memory_transaction_total_max=0
- NEO4J_dbms_memory_heap_initial__size=32g
- NEO4J_dbms_memory_heap_max__size=32g
- what kind of API / driver do you use
I am using python with py2neo package.
I tried 2 queries:
First using apoc
MATCH (cur_node)
WHERE ANY (id IN cur_node.ID WHERE id IN $node_id_list)
WITH cur_node
CALL apoc.cypher.run(
'MATCH (cur_node){direction_arrow}(neighbors)
WITH rand() as r, neighbors
ORDER BY r
LIMIT $num_samples
RETURN collect(r) AS rand, collect(neighbors.ID) AS nn',
{{cur_node: cur_node, num_samples: $num_samples}}) YIELD value
RETURN value.rand, value.nn AS neighbors
OUTPUT:
first iteration:
{'value.rand':
[0.00015447678200963821, 0.0002513570108183538, 0.00036323837330132225, 0.0004571500939272166, 0.0005692942868501527, 0.0006319216646581971, 0.0009800730498202848, 0.0009975206145961257, 0.0012616359436184998, ...],
'neighbors': [171053, 125765, 89153, 49813, 57168, 140995, 81320, 67133, 216481, ...]}
second iteration:
{'value.rand':
[0.00015406391020234, 0.00022974141129827874, 0.00027559936540422214, 0.00048246754345004916, 0.0005592177219301275, 0.0008769336556695428, 0.0009803530300780405, 0.0010443457475343143, 0.001150494865988283, ...],
'neighbors': [76714, 187927, 213515, 166957, 52992, 182661, 73519, 150725, 127881, ...]}
and without it using CALL {}
MATCH (cur_node)
WHERE ANY (id IN cur_node.ID WHERE id IN $node_id_list)
CALL {{
WITH cur_node
MATCH (cur_node){direction_arrow}(neighbors)
WITH rand() as r, neighbors
ORDER BY r
LIMIT $num_samples
RETURN collect(r) AS rand, collect(neighbors.ID) AS nn
}}
RETURN rand, nn AS neighbors
OUTPUT:
first iteration:
{'rand':
[0.15242892235650374, 0.15242892235650374, 0.15242892235650374, 0.15242892235650374, 0.15242892235650374, 0.15242892235650374, 0.15242892235650374, 0.15242892235650374, 0.15242892235650374, ...],
'neighbors': [3202, 232954, 232774, 231676, 231841, 231775, 231566, 231065, 230755, ...]}
second iteration
{'rand':
[0.9181689214459184, 0.9181689214459184, 0.9181689214459184, 0.9181689214459184, 0.9181689214459184, 0.9181689214459184, 0.9181689214459184, 0.9181689214459184, 0.9181689214459184, ...],
'neighbors': [3202, 232954, 232774, 231676, 231841, 231775, 231566, 231065, 230755, ...]}
}
The issue is that when using CALL {}, it seems that I only sample one random value and so the sorting / LIMIT is meaningless. I could use apoc, but I understood that it's better performance-wise to use CALL {}. (It is also easier to debug with PROFILE)
As for the apoc query, we have the correct result.
Any idea what I may be doing wrong?