Can this query be optimized? GDS Node similarity

I am using gds.nodeSimilarity.stream to calculate similarity between my nodes (aprox. 3 Millions), Node A is the main one, which all other node labels connect to.
I'm using the following query for an anonymous graph:

CALL gds.nodeSimilarity.stream({nodeProjection: ['LabelA', 'LabelB', 'LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelH'], relationshipProjection: ['rel-a-b', 'rel-a-c', 'rel-a-d', 'rel-a-e', 'rel-a-f', 'rel-a-g'], relationshipProperties: 'weight', relationshipWeightProperty: 'weight', similarityCutoff: 0.80, degreeCutoff: 6})
YIELD node1, node2, similarity

The machine I'm running it is a octa-core, with 16G of RAM available. My memory configs are the following:

dbms.memory.heap.initial_size = 5100m
dbms.memory.heap.max_size = 5100m
dbms.memory.pagecache.size = 6900m

Right now it does take hours to finish the algorithm. I know the time complexity of this query is absurdly huge, but I'm not sure how to make it better.

The simplest things you can do are:

  • setting topK and topN to limit the similarity relationships being stored
  • if you're on GDS EE, set concurrency as high as you can

For a slightly more complicated approach:

  • Load a named graph instead of an anonymous graph with
CALL gds.graph.create('my-graph', ['LabelA', 'LabelB', 'LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelH'],  ['rel-a-b', 'rel-a-c', 'rel-a-d', 'rel-a-e', 'rel-a-f', 'rel-a-g'], {relationshipProperties:'weight'})
  • Run degree centrality in mutate mode (call gds.degree.mutate('my-graph',{mutateProperty:'degree'})

  • Create a filtered subgraph and eliminate the highest degree nodes: CALL gds.beta.graph.create.subgraph('new-graph','my-graph','n.degree < 100')

  • Run WCC on your new filtered graph: (call gds.wcc.mutate('new-graph',{mutateProperty:'component',minComponentSize:5})

  • Now split that new graph into 1 community per sub graph:

CALL gds.graph.streamNodeProperties('graph', ['communityId'])
   YIELD propertyValue as communityId
   WITH DISTINCT communityId as communityId
   WITH communityId, 'community' + communityId as graphName, 'n.communityId = ' + communityId as nodeFilter
   CALL gds.beta.graph.create.subgraph(graphName, 'graph', nodeFilter, '*');

That will give you one subgraph per community. You can run NodeSimilarity over each of those small, disjointed graphs much more quickly with a simple - CALL gds.nodeSimilarity.write('community1',writeRelationshipType:'SIMILAR',writeProperty:'score') etc

1 Like

Thanks for your reply! I've been playing with the queries you gave for the last two days, but I can't figure out an error it is throwing. When I try to split the graph into sub graphs for each community it tells me that communityId is type 'any' and it can't concatenate it with a string. I've returned the value of communityId and it's just integers, not a single null value. I've tried using toString() but the result is the same.
Also I can't finish the query with call so I put a return just to finish it.

CALL gds.graph.streamNodeProperties('graph', ['component'])
   YIELD propertyValue as communityId
   WITH DISTINCT communityId as communityId
   CALL gds.beta.graph.create.subgraph(communityId, 'graph', 'n.communityId = ' + communityId, '*')
   RETURN 'ok'

The any type error can resolved by using toIntegerOrNull(communityId).
If you have a single property, gds.graph.streamNodeProperty is prefered and you can yield informations for each subgraph instead of RETURN ok.

CALL gds.graph.streamNodeProperty('graph', 'component')
YIELD propertyValue AS communityId
WITH DISTINCT communityId
WITH communityId, 'community' + toIntegerOrNull(communityId) as graphName, 'n.component = ' + toIntegerOrNull(communityId) as nodeFilter
CALL gds.beta.graph.create.subgraph(graphName, 'graph', nodeFilter, '*')
YIELD graphName AS subGraphName, nodeCount, relationshipCount
RETURN subGraphName, nodeCount, relationshipCount

Hope that solves your issue :slight_smile:

3 Likes