A efficient method for getting random nodes

hamed.metalgear · August 16, 2025, 7:50pm

I am trying to sample n nodes from a large graph. I am currently using the following query:

MATCH (p:Person) 
  WHERE rand() < 0.1
  WITH p
  ORDER BY rand()
  LIMIT 1000
  RETURN p

This query works, but it is extremely slow, as it takes 1 minute to sample 1000 nodes, and that is not scalable in my application, where I need to sample many times.

joshcornejo · August 17, 2025, 8:10am

I think the reason it is slow is because it has to calculate rand() on each node …

IF you are looking for “true” randomness … I think this is the only approach.

For pseudo randomness:

// Add random properties (run once per node)
MATCH (p:Person)
  SET p.randomValue = rand()

// you could calculate the rand() once and assign it to a parameter ("less random")

MATCH (p:Person)
  WHERE p.randomValue >= rand()
RETURN p
LIMIT 1000

hamed.metalgear · August 17, 2025, 2:06pm

Wouldn't the solution you suggested defeat the purpose of randomness? i.e., if we rely on a fixed property of a node (randomly chosen initially), then this is as good as using the node ID (converted to a float between 0-1).

I think the reason it is slow is because it has to calculate rand() on each node

Do you mean it first assigns a random number to every node in the graph database, then chooses nodes? If so, would it not be better to assign a random property on the fly for every node it iterates on?

joshcornejo · August 17, 2025, 2:10pm

I think the way it works is that it calculates on each node … either way is inefficient.

The suggestion is pseudo-random, since the rand() still is calculated and compared against a value.

So it depends on what your objective is.

hamed.metalgear · August 17, 2025, 2:28pm

I rather have true random, but I can also use pseudorandom for dev purposes where I need to fix seed (which Neo4j currently does not support). However, given that I have a graph with over 100 billion nodes, setting a property on every node does not seem efficient enough (albeit you’ll run it only once, and update all the import methods to add this property for every new node inserted into the graph).

joshcornejo · August 17, 2025, 2:34pm

You could implement algorithms to extract subsets from those billions of nodes to make them more manageable and then use “true” randomness.

For example, using dates as epochs gives you a number - you can do a mod X to extract a semi-random sample of N nodes … then play on those N with your original query:

MATCH (n:Person)
WHERE n.myNumberProperty % myValue2Mod = 0
... copy them to :toRandomPerson label

hamed.metalgear · August 17, 2025, 2:44pm

That is a good pseudorandom solution to get a subset of nodes and then use a truly random selection. This may still not scale well, as getting that subset may still require a decent amount of memory on a single machine, so you would be limited to a very small sample size. But a small sample size could not be representative enough of the entire graph, so it is not ideal for my application.

Thanks for your suggestions; it seems my initial approach is the best solution, and I will need to accept the slow initial sampling.

joshcornejo · August 17, 2025, 3:21pm

From the net on statistics:

A sample size of 2000 can be appropriate for many statistical analyses, but the ideal sample size depends on several factors including the population size, the desired level of accuracy, and the variability within the population. For large, heterogeneous populations, a sample size of 2000 can provide a reasonable level of precision for many applications.

Topic		Replies	Views
Selecting a sub graph of n nodes Neo4j Graph Platform cypher , operations	4	291	October 25, 2021
Returning a random subset of nodes (ideally repeatable) Cypher	4	4093	June 24, 2019
WITH rand() change functionality between CALL {} and CALL apoc.cypher.run Cypher apoc , cypher	3	146	March 22, 2023
How to return random results Cypher cypher	7	464	December 14, 2021
GDS RandomWalk Performance Optimization Neo4j Graph Platform	1	353	April 3, 2024

Demystifying Neo4j UX Research

A efficient method for getting random nodes

Related topics