I am using gds.nodeSimilarity.stream to calculate similarity between my nodes (aprox. 3 Millions), Node A is the main one, which all other node labels connect to.
I'm using the following query for an anonymous graph:
CALL gds.nodeSimilarity.stream({nodeProjection: ['LabelA', 'LabelB', 'LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelH'], relationshipProjection: ['rel-a-b', 'rel-a-c', 'rel-a-d', 'rel-a-e', 'rel-a-f', 'rel-a-g'], relationshipProperties: 'weight', relationshipWeightProperty: 'weight', similarityCutoff: 0.80, degreeCutoff: 6})
YIELD node1, node2, similarity
The machine I'm running it is a octa-core, with 16G of RAM available. My memory configs are the following:
dbms.memory.heap.initial_size = 5100m
dbms.memory.heap.max_size = 5100m
dbms.memory.pagecache.size = 6900m
Right now it does take hours to finish the algorithm. I know the time complexity of this query is absurdly huge, but I'm not sure how to make it better.
The simplest things you can do are:
- setting topK and topN to limit the similarity relationships being stored
- if you're on GDS EE, set
concurrency
as high as you can
For a slightly more complicated approach:
- Load a named graph instead of an anonymous graph with
CALL gds.graph.create('my-graph', ['LabelA', 'LabelB', 'LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelH'], ['rel-a-b', 'rel-a-c', 'rel-a-d', 'rel-a-e', 'rel-a-f', 'rel-a-g'], {relationshipProperties:'weight'})
-
Run degree centrality in mutate mode (call gds.degree.mutate('my-graph',{mutateProperty:'degree'}
)
-
Create a filtered subgraph and eliminate the highest degree nodes: CALL gds.beta.graph.create.subgraph('new-graph','my-graph','n.degree < 100')
-
Run WCC on your new filtered graph: (call gds.wcc.mutate('new-graph',{mutateProperty:'component',minComponentSize:5}
)
-
Now split that new graph into 1 community per sub graph:
CALL gds.graph.streamNodeProperties('graph', ['communityId'])
YIELD propertyValue as communityId
WITH DISTINCT communityId as communityId
WITH communityId, 'community' + communityId as graphName, 'n.communityId = ' + communityId as nodeFilter
CALL gds.beta.graph.create.subgraph(graphName, 'graph', nodeFilter, '*');
That will give you one subgraph per community. You can run NodeSimilarity over each of those small, disjointed graphs much more quickly with a simple - CALL gds.nodeSimilarity.write('community1',writeRelationshipType:'SIMILAR',writeProperty:'score')
etc
1 Like
Thanks for your reply! I've been playing with the queries you gave for the last two days, but I can't figure out an error it is throwing. When I try to split the graph into sub graphs for each community it tells me that communityId is type 'any' and it can't concatenate it with a string. I've returned the value of communityId and it's just integers, not a single null value. I've tried using toString() but the result is the same.
Also I can't finish the query with call so I put a return just to finish it.
CALL gds.graph.streamNodeProperties('graph', ['component'])
YIELD propertyValue as communityId
WITH DISTINCT communityId as communityId
CALL gds.beta.graph.create.subgraph(communityId, 'graph', 'n.communityId = ' + communityId, '*')
RETURN 'ok'
The any
type error can resolved by using toIntegerOrNull(communityId)
.
If you have a single property, gds.graph.streamNodeProperty
is prefered and you can yield informations for each subgraph instead of RETURN ok
.
CALL gds.graph.streamNodeProperty('graph', 'component')
YIELD propertyValue AS communityId
WITH DISTINCT communityId
WITH communityId, 'community' + toIntegerOrNull(communityId) as graphName, 'n.component = ' + toIntegerOrNull(communityId) as nodeFilter
CALL gds.beta.graph.create.subgraph(graphName, 'graph', nodeFilter, '*')
YIELD graphName AS subGraphName, nodeCount, relationshipCount
RETURN subGraphName, nodeCount, relationshipCount
Hope that solves your issue
3 Likes