Getting a subgraph from a big graph

Hi folks,

I am attempting to get a subgraph and graph data(as '.txt 'or other formats) from a big graph

  • Approach 1:

Randomly sample all nodes types from the large graph

MATCH (source: Node)-[r*..]-(target: Node)
WHERE source.name<>target.name
WITH source, target
SKIP 10
LIMIT 1+rand(10)
RETURN *​

I couldn't get this to work because the estimated rows are large, and the connection times out frequently while streaming.

  • Approach 2:

Get some n hop relationship between 2 kinds of nodes, then extract the path data (including the source and target nodes, relationships, and the node data such as the node degree and node type). I have tried:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)<2
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

for Path length 2:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>1 AND length(P)<3
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

Then path length3:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>2 
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

This yeilds like a million rows; however, I would like to sample the subpaths such that for a three hops subgraph, I can get 3000 total rows containing:

  • 1000 rows of 1 hop connections( randomly sampled or top or bottom rows)

  • 1000 rows of 2 hop connections

  • 1000 rows of 3 hop connections

    source

    source type

    relationship

    target

    target type

    PathLength

Any help will be greatly appreciated.

Please explain little bit more of your data model. The 'Node' has a property 'name' besides 'type'? At each level are you expecting thousands of nodes? If so, then one source node is connected to thousands of target nodes at level 1. Here I am trying to understand your model to offer some solutions.

For instance:

I have a network from

CREATE (a:Node {name: 'mola', type: 'Molecule'})
                CREATE (g:Node {name: 'molg', type: 'Molecule'})
                CREATE (b:Node {name: 'drgb', type: 'Drug'})
                CREATE (h:Node {name: 'drgh', type: 'Drug'})
                CREATE (c:Node {name: 'mola', type: 'Disease'})
                CREATE (i:Node {name: 'disi', type: 'Disease'})
                CREATE (j:Node {name: 'disj', type: 'Disease'})
                CREATE (m:Node {name: 'dism', type: 'Disease'})
                CREATE (d:Node {name: 'chemd', type: 'Chemical'})
                CREATE (k:Node {name: 'chemk', type: 'Chemical'})
                CREATE (e:Node {name: 'genee', type: 'Gene'})
                CREATE (l:Node {name: 'genel', type: 'Gene'})
                CREATE (f:Node {name: 'mola', type: 'DNA'})
                MERGE (a)-[:REL {r: 'subclass_of'}]->(b)
                MERGE (a)-[:REL {r: 'cure'}]->(c)
                MERGE (a)-[:REL {r: 'inhibits'}]->(d)
                MERGE (b)-[:REL {r: 'heals'}]->(d)
                MERGE (c)-[:REL {r: 'causes'}]->(d)
                MERGE (c)-[:REL {r: 'expands'}]->(e)
                MERGE (d)-[:REL {r: 'kills'}]->(e)
                MERGE (d)-[:REL {r: 'involved_in'}]->(f)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (f)-[:REL {r: 'kills'}]->(l)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (l)-[:REL {r: 'kills'}]->(l)
                MERGE (m)-[:REL {r: 'heals'}]->(i)
                MERGE (a)-[:REL {r: 'part_of'}]->(e)
                MERGE (c)-[:REL {r: 'expands'}]->(m)
                MERGE  (e)-[:REL {r: 'interacts_with'}]->(f)

Using

MATCH (source),(target) 
            WHERE source<> 'None' AND target<>'None' AND source<target
            CALL apoc.algo.allSimplePaths(source, target, '', 4)
            YIELD path AS P
           RETURN P, length(P) 

I got:

P length(P)

(mola)-[:REL {r: 'subclass_of'}]->(drgb),1

(mola)-[:REL {r: 'inhibits'}]->(chemd),1

(drgb)-[:REL {r: 'heals'}]->(disi),1

(chemd)-[:REL {r: 'kills'}]->(genee),1

(chemd)-[:REL {r: 'involved_in'}]->(mola),1

(disi)<-[:REL {r: 'heals'}]-(dism),1

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'heals'}]-(drgb),2

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),2

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(disi),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),2

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola)-[:REL {r: 'kills'}]->(genel),3

(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'kills'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'causes'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(disi)<-[:REL {r: 'heals'}]-(dism)<-[:REL {r: 'expands'}]-(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'kills'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'inhibits'}]-(mola)-[:REL {r: 'cure'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),3

(mola)<-[:REL {r: 'cure'}]-(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),4

(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'expands'}]->(dism),4

(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),4

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),4

My Question:

How can I randomly return only the subset of the path- representative of all path lengths? Eg.

(mola)-[:REL {r: 'inhibits'}]->(chemd),1

(drgb)-[:REL {r: 'heals'}]->(disi),1

(chemd)-[:REL {r: 'kills'}]->(genee),1

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),2

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(disi),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'causes'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(disi)<-[:REL {r: 'heals'}]-(dism)<-[:REL {r: 'expands'}]-(mola),3

(mola)<-[:REL {r: 'cure'}]-(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),4

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),4

Thanks for your help.

Try this and check the numbers:

  1. MATCH (source:Node{type: 'typeA'})
  2. CALL apoc.path.spanningTree (source, {maxLevel: 1}}) YIELD path
  3. WITH distinct length(p) as lvl, nodes(p) as n1, relationships(p) as rel
  4. UNWIND n1 as n2
  5. UNWIND rel as rels
  6. RETURN lvl, count(distinct n2) as nodeCnt, count(distinct type(rels)) as relCnt

Thanks for sharing the info. The solution is not straight forward and am working on it. Hopefully by this weekend I can send you the first steps for your solution. The path level 2 results contain the nodes in level 1 and 2 and so on.

I used your sample data and ran this query:

MATCH (source:Node{type: 'Molecule'}),(target:Node{type: 'Gene'})

WHERE source.name<>target.name

CALL apoc.algo.allSimplePaths(source, target, '', 4) YIELD path

with relationships(path) as rel , nodes(path) as n1, length(path) as lvl

unwind n1 as n2

unwind rel as rels

with lvl, collect(distinct n2.type) as lbl, collect(distinct id(n2)) as ids, collect(distinct rels.r) as r1

return lvl, lbl, ids, r1, size(ids) as cnt order by lvl

Result:

Please run the above query in your database. If there is too much data, then run for levels 1 and 2 and let me know the node counts. Based on the node counts we can try some methods to extract a subset of nodes from each level. This is not going a direct process and may involve several steps.

I deeply appreciate your help, maybe a few more lines here could clarify my issues:

Say I have allsimplepaths(A, B, '', 3) that look like this:

  • [A –>relation1 –>B]
  • [A –>relation2 –>B]
  • [A –>relation2->C->relation 1–>B]
  • [A –>relation5->D->relation 3–>B]
  • [A –>relation2->Y->relation 1–>B]
  • [A –>relation2->E->relation 1–>B]
  • [A –>relation2->D–>relation2->F->relation 1–>B]
  • [A –>relation2->F–>relation4->Y->relation 2–>B]

Desired result: FOREACH pathlength, randomly return 1 row

  • [A –>relation2 –>B]
  • [A –>relation2->Y->relation 1–>B]
  • [A –>relation2->D–>relation2->F->relation 1–>B]

The result is representative of all pathlengths:

The first row:

  • [A –>relation2 –>B] is a sample from path length 1

The second row:

  • [A –>relation2->Y->relation 1–>B is a sample from path length 2

the third row:

  • [A –>relation2->D–>relation2->F->relation 1–>B]. is a sample from path length 3

This code will export the results as a json file. For selecting random rows for each level you need to export the data for each level. Select the data rows for each level and you need to combine the results from each level.

MATCH (source:Node{type: 'Molecule'}),(target:Node{type: 'Gene'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path

with relationships(path) as rels , nodes(path) as n1, length(path) as lvl
with lvl, collect(distinct n1) as n2, collect(distinct rels) as r2
with apoc.coll.toSet(apoc.coll.flatten(n2)) AS n12, apoc.coll.toSet(apoc.coll.flatten(r2)) AS r12, lvl

with n12 as nodes, r12 as relationships, lvl

WITH lvl, [ node in nodes | node {.*, label:labels(node)[0], id:tostring(id(node))}] as nodes,
[rel in relationships | rel {.*, fromNode:{label:labels(startNode(rel))[0], id:tostring(id(startNode(rel)))},type:type(rel), toNode:{label:labels(endNode(rel))[0], id:tostring(id(endNode(rel)))}}] as rels
With lvl, collect(distinct rels) as Allrels, collect(distinct nodes) as AllNodes order by lvl
WITH {nodes:AllNodes, relationships:Allrels, level:lvl} as json
RETURN apoc.convert.toJson(json)
Result: