Returning a random subset of nodes (ideally repeatable)

oleg_neo4j · June 20, 2019, 6:36pm

Hello,
I'm trying to get a random subset of nodes returned (I'm downsampling my data here) and I would like it to be repeatable. Could there a way to set a seed in the rand() function?
My code is:

MATCH (doc:Document)
return doc.title, rand() as rand
ORDER BY rand ASC Limit 10

Also, is there a more efficient way with fewer db hits to get 10 random documents, as this way it goes to all the Document nodes to then just pick 10 at the end?

stefan.armbruster · June 20, 2019, 6:57pm

Try this:

match (:Document) with count(*) as docCount
match (doc:Document)
where rand() < 10.0/docCount
return doc.title

Note that this does not always give you 10 docs back, so you might have a larger treshold and a limit 10 at the end.
That statement does not require large intermediary datastructures but it still iterates all the documents.

Another approach preventing a full label scan is below. First we need the highest ID in use, then try to find a a random node by id that has the right label:

CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Primitive count") YIELD attributes
WITH attributes.NumberOfNodeIdsInUse.value as maxId
UNWIND range(0,100000) as x
MATCH (d:Document) where id(d) = toInteger(rand()*maxId)
return d limit 10

oleg_neo4j · June 22, 2019, 5:20am

Thank you for the response :) With PROFILE I found that the first two are basically equal with 38k db hits, but the last one has fewer db hits only for very low limits, fewer than ~15 in my case (38k documents), but then gets much higher with a limit higher than that.
If I use one of the first two versions, is there a way to make the results repeatable with a seed of some sort, or where is the right place to request that as a feature?

andrew_bowman · June 22, 2019, 11:36pm

I don't think that would be possible, since ids can be reused as nodes are deleted and new nodes added. That alone would defeat any ability to have a seed that can repeat results based on graph id lookup.

oleg_neo4j · June 24, 2019, 4:08am

Ah, yah, that does make sense, thanks for the reply :)

Topic		Replies	Views
How to return random results Cypher cypher	7	479	December 14, 2021
Selecting a sub graph of n nodes Neo4j Graph Platform cypher , operations	4	300	October 25, 2021
Use rand function with a fixed seed Cypher random , cypher	0	335	February 17, 2023
Generate test data Neo4j Graph Platform random , cypher	4	593	February 3, 2022
Hello, How can I go through the paths of a query in different ways? Cypher	2	208	June 24, 2021

Returning a random subset of nodes (ideally repeatable)

Related topics