I have a neo4j graph of 600000 nodes which are connected to each other in the form (a:item)-[r]-(b:item). How do I get a random sample of graph network with 10000 nodes with relations between them?
(bascially what I require is a random graph consisting of relationships between 10k nodes(item))
MATCH (n:item)-[]->(:item) with id(n) as maxId order by id(n) desc limit 1
WITH maxId // retrieve maxId on node with outgoing relationships
UNWIND range(1,10000*3) as x // change this number (3) if needed
MATCH p=(n:item)-[r]->(m:item)
WHERE id(n) = toInteger(rand()*maxId) // get node with an id from 0 to maxId
RETURN distinct p
LIMIT 10000
That is, I cycle 10000 times multiplied by 3 and I search for an id between 0 and maxId.
I multiplied by 3 because I'm not sure if the current random id match a path p=(n:item)-[r]->(m:item) or something else. Because of this, I put limit 10000 to make sure I find no more than 10000.
You could change this 3 based on your dataset.
Of course, if this number is relatively too small, less than 10,000 nodes could be extracted.
@giuseppe_villan
Thanks for the answer, but there is one problem with this solution. What I exactly wanted is to get all relations between 10000 items (each item has multiple relations to other items). So here we might miss many. Let's say we cycle n times, I found 10k (n:Items), but due to relationships let's say I will get 30k(m:Items), then due to LIMIT 10k, I am loosing information.
So the solution that I wanted was to get a sample/sub(random) network of 10k items, which have relations between them. So something like getting a list of 10k random item ids and then checking that (n:Item) and (m:Item) are in that list of ids will work I guess. But i'm not sure how to do it.
@cortex3oct
Ok, i get it, I thought you wanted to limit the paths, not the nodes.
So, I would change the query like this (that is, I limit number of nodes with at least 1 rel with another :item before matching all paths):
MATCH (n:item)-[]-(:item) with id(n) as maxId order by id(n) desc limit 1
WITH maxId // retrieve maxId on node with outgoing relationships
UNWIND range(1,10000*5) as x // change this number (5) if needed
MATCH (n:item)-[]-(:item)
where id(n) = toInteger(rand()*maxId)
with distinct n
LIMIT 10000 // limit nodes
match p=(n)-[r]-(:item)
return p
Actually, the problem is more trickier that it seems. Just in order to confirm my understanding. You expect that every relation of a node inside the subgraph lies inside the subgraph, aren't you?
In other word, there's no relation between a node inside the subgraph and one outside?