Can this query be optimized? GDS Node similarity

andreperez · July 27, 2021, 3:13pm

I am using gds.nodeSimilarity.stream to calculate similarity between my nodes (aprox. 3 Millions), Node A is the main one, which all other node labels connect to.
I'm using the following query for an anonymous graph:

CALL gds.nodeSimilarity.stream({nodeProjection: ['LabelA', 'LabelB', 'LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelH'], relationshipProjection: ['rel-a-b', 'rel-a-c', 'rel-a-d', 'rel-a-e', 'rel-a-f', 'rel-a-g'], relationshipProperties: 'weight', relationshipWeightProperty: 'weight', similarityCutoff: 0.80, degreeCutoff: 6})
YIELD node1, node2, similarity

The machine I'm running it is a octa-core, with 16G of RAM available. My memory configs are the following:

dbms.memory.heap.initial_size = 5100m
dbms.memory.heap.max_size = 5100m
dbms.memory.pagecache.size = 6900m

Right now it does take hours to finish the algorithm. I know the time complexity of this query is absurdly huge, but I'm not sure how to make it better.

alicia_frame1 · July 27, 2021, 5:00pm

The simplest things you can do are:

setting topK and topN to limit the similarity relationships being stored
if you're on GDS EE, set concurrency as high as you can

For a slightly more complicated approach:

Load a named graph instead of an anonymous graph with

CALL gds.graph.create('my-graph', ['LabelA', 'LabelB', 'LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelH'],  ['rel-a-b', 'rel-a-c', 'rel-a-d', 'rel-a-e', 'rel-a-f', 'rel-a-g'], {relationshipProperties:'weight'})

Run degree centrality in mutate mode (call gds.degree.mutate('my-graph',{mutateProperty:'degree'})
Create a filtered subgraph and eliminate the highest degree nodes: CALL gds.beta.graph.create.subgraph('new-graph','my-graph','n.degree < 100')
Run WCC on your new filtered graph: (call gds.wcc.mutate('new-graph',{mutateProperty:'component',minComponentSize:5})
Now split that new graph into 1 community per sub graph:

CALL gds.graph.streamNodeProperties('graph', ['communityId'])
   YIELD propertyValue as communityId
   WITH DISTINCT communityId as communityId
   WITH communityId, 'community' + communityId as graphName, 'n.communityId = ' + communityId as nodeFilter
   CALL gds.beta.graph.create.subgraph(graphName, 'graph', nodeFilter, '*');

That will give you one subgraph per community. You can run NodeSimilarity over each of those small, disjointed graphs much more quickly with a simple - CALL gds.nodeSimilarity.write('community1',writeRelationshipType:'SIMILAR',writeProperty:'score') etc

andreperez · July 29, 2021, 12:22pm

Thanks for your reply! I've been playing with the queries you gave for the last two days, but I can't figure out an error it is throwing. When I try to split the graph into sub graphs for each community it tells me that communityId is type 'any' and it can't concatenate it with a string. I've returned the value of communityId and it's just integers, not a single null value. I've tried using toString() but the result is the same.
Also I can't finish the query with call so I put a return just to finish it.

CALL gds.graph.streamNodeProperties('graph', ['component'])
   YIELD propertyValue as communityId
   WITH DISTINCT communityId as communityId
   CALL gds.beta.graph.create.subgraph(communityId, 'graph', 'n.communityId = ' + communityId, '*')
   RETURN 'ok'

florentin_dorre · August 3, 2021, 11:30am

The any type error can resolved by using toIntegerOrNull(communityId).
If you have a single property, gds.graph.streamNodeProperty is prefered and you can yield informations for each subgraph instead of RETURN ok.

CALL gds.graph.streamNodeProperty('graph', 'component')
YIELD propertyValue AS communityId
WITH DISTINCT communityId
WITH communityId, 'community' + toIntegerOrNull(communityId) as graphName, 'n.component = ' + toIntegerOrNull(communityId) as nodeFilter
CALL gds.beta.graph.create.subgraph(graphName, 'graph', nodeFilter, '*')
YIELD graphName AS subGraphName, nodeCount, relationshipCount
RETURN subGraphName, nodeCount, relationshipCount

Hope that solves your issue

Topic		Replies	Views
Find similarity of given node with entire graph Neo4j Graph Platform migrated	9	275	December 8, 2022
Get similarity node pairs with all the common relationships Cypher cypher	7	514	May 31, 2021
Calculate similarity for Nodes in the same level and calculate similarity betweeen two sub-graph depths Graph Algorithms/Graph Data Science	5	756	August 31, 2021
Comparing Jaccard Similarity (Neo4J 3.4) to Node Similarity on Neo4j 3.5 and GDS 1.1.1 Graph Algorithms/Graph Data Science	8	627	April 22, 2021
I am searching a performance comparison for all GDS algorithms Graph Algorithms/Graph Data Science	2	342	July 5, 2021

Can this query be optimized? GDS Node similarity

Related topics