Getting a subgraph from a big graph

wumirose · June 29, 2022, 6:19pm

Hi folks,

I am attempting to get a subgraph and graph data(as '.txt 'or other formats) from a big graph

Approach 1:

Randomly sample all nodes types from the large graph

MATCH (source: Node)-[r*..]-(target: Node)
WHERE source.name<>target.name
WITH source, target
SKIP 10
LIMIT 1+rand(10)
RETURN *

I couldn't get this to work because the estimated rows are large, and the connection times out frequently while streaming.

Approach 2:

Get some n hop relationship between 2 kinds of nodes, then extract the path data (including the source and target nodes, relationships, and the node data such as the node degree and node type). I have tried:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)<2
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

for Path length 2:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>1 AND length(P)<3
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

Then path length3:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>2 
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

This yeilds like a million rows; however, I would like to sample the subpaths such that for a three hops subgraph, I can get 3000 total rows containing:

1000 rows of 1 hop connections( randomly sampled or top or bottom rows)
1000 rows of 2 hop connections
1000 rows of 3 hop connections

source

source type

relationship

target

target type

PathLength

Any help will be greatly appreciated.

ameyasoft · June 30, 2022, 12:01am

Please explain little bit more of your data model. The 'Node' has a property 'name' besides 'type'? At each level are you expecting thousands of nodes? If so, then one source node is connected to thousands of target nodes at level 1. Here I am trying to understand your model to offer some solutions.

wumirose · June 30, 2022, 11:22am

For instance:

I have a network from

CREATE (a:Node {name: 'mola', type: 'Molecule'})
                CREATE (g:Node {name: 'molg', type: 'Molecule'})
                CREATE (b:Node {name: 'drgb', type: 'Drug'})
                CREATE (h:Node {name: 'drgh', type: 'Drug'})
                CREATE (c:Node {name: 'mola', type: 'Disease'})
                CREATE (i:Node {name: 'disi', type: 'Disease'})
                CREATE (j:Node {name: 'disj', type: 'Disease'})
                CREATE (m:Node {name: 'dism', type: 'Disease'})
                CREATE (d:Node {name: 'chemd', type: 'Chemical'})
                CREATE (k:Node {name: 'chemk', type: 'Chemical'})
                CREATE (e:Node {name: 'genee', type: 'Gene'})
                CREATE (l:Node {name: 'genel', type: 'Gene'})
                CREATE (f:Node {name: 'mola', type: 'DNA'})
                MERGE (a)-[:REL {r: 'subclass_of'}]->(b)
                MERGE (a)-[:REL {r: 'cure'}]->(c)
                MERGE (a)-[:REL {r: 'inhibits'}]->(d)
                MERGE (b)-[:REL {r: 'heals'}]->(d)
                MERGE (c)-[:REL {r: 'causes'}]->(d)
                MERGE (c)-[:REL {r: 'expands'}]->(e)
                MERGE (d)-[:REL {r: 'kills'}]->(e)
                MERGE (d)-[:REL {r: 'involved_in'}]->(f)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (f)-[:REL {r: 'kills'}]->(l)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (l)-[:REL {r: 'kills'}]->(l)
                MERGE (m)-[:REL {r: 'heals'}]->(i)
                MERGE (a)-[:REL {r: 'part_of'}]->(e)
                MERGE (c)-[:REL {r: 'expands'}]->(m)
                MERGE  (e)-[:REL {r: 'interacts_with'}]->(f)

Using

MATCH (source),(target) 
            WHERE source<> 'None' AND target<>'None' AND source<target
            CALL apoc.algo.allSimplePaths(source, target, '', 4)
            YIELD path AS P
           RETURN P, length(P)

I got:

P length(P)

(mola)-[:REL {r: 'subclass_of'}]->(drgb),1

(mola)-[:REL {r: 'inhibits'}]->(chemd),1

(drgb)-[:REL {r: 'heals'}]->(disi),1

(chemd)-[:REL {r: 'kills'}]->(genee),1

(chemd)-[:REL {r: 'involved_in'}]->(mola),1

(disi)<-[:REL {r: 'heals'}]-(dism),1

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'heals'}]-(drgb),2

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),2

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(disi),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),2

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola)-[:REL {r: 'kills'}]->(genel),3

(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'kills'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'causes'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(disi)<-[:REL {r: 'heals'}]-(dism)<-[:REL {r: 'expands'}]-(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'kills'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'inhibits'}]-(mola)-[:REL {r: 'cure'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),3

(mola)<-[:REL {r: 'cure'}]-(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),4

(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'expands'}]->(dism),4

(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),4

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),4

My Question:

How can I randomly return only the subset of the path- representative of all path lengths? Eg.