I am trying to sample n nodes from a large graph. I am currently using the following query:
MATCH (p:Person)
WHERE rand() < 0.1
WITH p
ORDER BY rand()
LIMIT 1000
RETURN p
This query works, but it is extremely slow, as it takes 1 minute to sample 1000 nodes, and that is not scalable in my application, where I need to sample many times.
I think the reason it is slow is because it has to calculate rand() on each node …
IF you are looking for “true” randomness … I think this is the only approach.
For pseudo randomness:
// Add random properties (run once per node)
MATCH (p:Person)
SET p.randomValue = rand()
// you could calculate the rand() once and assign it to a parameter ("less random")
MATCH (p:Person)
WHERE p.randomValue >= rand()
RETURN p
LIMIT 1000
Wouldn't the solution you suggested defeat the purpose of randomness? i.e., if we rely on a fixed property of a node (randomly chosen initially), then this is as good as using the node ID (converted to a float between 0-1).
I think the reason it is slow is because it has to calculate rand() on each node
Do you mean it first assigns a random number to every node in the graph database, then chooses nodes? If so, would it not be better to assign a random property on the fly for every node it iterates on?
I rather have true random, but I can also use pseudorandom for dev purposes where I need to fix seed (which Neo4j currently does not support). However, given that I have a graph with over 100 billion nodes, setting a property on every node does not seem efficient enough (albeit you’ll run it only once, and update all the import methods to add this property for every new node inserted into the graph).
You could implement algorithms to extract subsets from those billions of nodes to make them more manageable and then use “true” randomness.
For example, using dates as epochs gives you a number - you can do a mod X to extract a semi-random sample of N nodes … then play on those N with your original query:
MATCH (n:Person)
WHERE n.myNumberProperty % myValue2Mod = 0
... copy them to :toRandomPerson label
That is a good pseudorandom solution to get a subset of nodes and then use a truly random selection. This may still not scale well, as getting that subset may still require a decent amount of memory on a single machine, so you would be limited to a very small sample size. But a small sample size could not be representative enough of the entire graph, so it is not ideal for my application.
Thanks for your suggestions; it seems my initial approach is the best solution, and I will need to accept the slow initial sampling.
A sample size of 2000 can be appropriate for many statistical analyses, but the ideal sample size depends on several factors including the population size, the desired level of accuracy, and the variability within the population. For large, heterogeneous populations, a sample size of 2000 can provide a reasonable level of precision for many applications.