Running graph algorithms with APOC periodic.iterate

lavanya_kannan · February 20, 2020, 8:50pm

Hello,

I am trying to run closeness centrality on 624985 nodes and 54191395 edges with the following query (which worked fine for a smaller instance on 3566981 edges within minutes):

CALL algo.closeness('alias', 'through_citations', {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness.centrality_throughCitations'})
YIELD nodes,loadMillis, computeMillis, writeMillis;

This query is running for more than 2 hours and has not written the graph properties for any nodes. What are the next steps I could take to make sure if it will run ok? Is there a way to make this run with any APOC procedure?

Thanks,
Lavanya

lavanya_kannan · February 20, 2020, 9:02pm

Update: I tried

CALL algo.memrec('alias', 'through_citations', "algo.closeness", {graph: "huge"}) YIELD nodes, relationships, requiredMemory, bytesMin, bytesMax RETURN nodes, relationships, requiredMemory, bytesMin, bytesMax

to see if I can check any memory requirements for running the algorithm, with no luck:

Error
Neo.ClientError.Procedure.ProcedureCallFailed
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `algo.memrec`: Caused by: java.lang.IllegalArgumentException: The procedure [algo.closeness] does not support memrec or does not exist, the available and supported procedures are {beta.k1coloring, beta.modularityOptimization, beta.wcc, graph.load, labelPropagation, louvain, nodeSimilarity, pageRank, unionFind, wcc}.

shan · February 21, 2020, 4:08pm

This is probably not an answer. I am just sharing my experience with running graph algorithms against large graphs:

algo.memrec is not supported for all algorithms. You can see a list of the supported ones in the error message you got.
Make sure you give neo4j large enough heap and pagecache.
Use concurrency parameter if you have the enterprise version.
Run the algorithm against a small portion of your graph first to see how it performs.
Use community detection algorithms (or any other algorithm) to break your graph into smaller subgraphs and then run closeness separately on every subgraph.

alicia.frame · February 21, 2020, 7:37pm

You can also turn on the debug log and you'll see output/progress report as your algorithm executes.

lavanya_kannan · February 21, 2020, 9:59pm

@shan @alicia.frame Thanks for the suggestions.

I now computed the connected components of the graph and want to try to speed up the below query by running the below query for each individual component:

CALL algo.unionFind('alias', 'through_citations', {graph:'huge', seedProperty:'GraphProperty_wcc_throughCitations', write:true, writeProperty:'GraphProperty_wcc_throughCitations'})
YIELD nodes AS Nodes, setCount AS NbrOfComponents, writeProperty AS PropertyName;

Is cypher projection the only way out? I do not think the below code is parallelizing the computations, does it?

CALL algo.closeness('MATCH (n:alias) RETURN id(n) AS id',
  'MATCH (n)-[:through_citations]-(m:alias) where n.GraphProperty_wcc_throughCitations == GraphProperty_wcc_throughCitations RETURN id(n) AS source, id(m) AS target', {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
YIELD nodes,loadMillis, computeMillis, writeMillis;

Thanks,
Lavanya

alicia.frame · February 24, 2020, 2:38pm

You could either:

Set a label an additional label on each alias with its community id, and then loop over each community label and run closeness centrality on that subset, or
Use a cypher statement to identify all the labels, and loop over the communities using a cypher projection. Your query is almost there but you'll need to update the node and relationship queries:

CALL algo.closeness('MATCH (n:alias) WHERE n.GraphProperty_wcc_throughCitations = [value] RETURN id(n) AS id',
  'MATCH (n)-[:through_citations]-(m:alias) where n.GraphProperty_wcc_throughCitations = [value] RETURN id(n) AS source, id(m) AS target', 
  {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})

Closeness centrality is parallelized, but if you want to make the loop over the communities itself parallel, you could use something like apoc.mapParallel2.

I would probably use a threshold and ignore any components with fewer than (for example) 5 members just to limit the number of communities you inspect (and if they're small the closeness will be low anyways).

lavanya_kannan · February 24, 2020, 5:03pm

@alicia.frame

Here is what I am using:

MATCH (n:alias) WITH DISTINCT n.GraphProperty_wcc_coauthors as value
CALL algo.closeness('MATCH (n:alias) WHERE n.GraphProperty_wcc_coauthors = $value RETURN id(n) AS id',
  'MATCH (n)-[:co_authors]-(m:alias) where m.GraphProperty_wcc_coauthors = $value RETURN id(n) AS source, id(m) AS target', 
  {graph:'cypher', params: {value: value}, write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
  YIELD nodes,loadMillis, computeMillis, writeMillis
  RETURN nodes,loadMillis, computeMillis, writeMillis

Will incorporate apoc.mapParallel2 soon.

Thanks

lavanya_kannan · February 24, 2020, 6:15pm

@alicia.frame

The above query returned the following error:

Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `algo.closeness`: Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2

kindly let me know how I may troubleshoot this.

Best,
Lavanya

lavanya_kannan · February 25, 2020, 5:23pm

@shan @alicia.frame @andrew_bowman

Update:

CALL apoc.periodic.iterate(
  "MATCH (comp:GraphProperty_wcc_throughTopic) RETURN comp.GraphProperty_component AS component", 
  "CALL algo.closeness('MATCH (n:alias  {GraphProperty_wcc_throughTopic : $component}) RETURN id(n) AS id',
  'MATCH (n)-[r:through_topic]-(m:alias) RETURN id(n) AS source, id(m) AS target, r.weight as weight', 
  {graph:'cypher', params: {component: component}, write:true, writeProperty:'GraphProperty_closeness_centrality_throughTopic'})
  YIELD nodes,loadMillis, computeMillis, writeMillis
  RETURN nodes,loadMillis, computeMillis, writeMillis", {batchSize:5000, parallel:true})           
YIELD batches, total, errorMessages;

I got this above query working for smaller instance. For my bigger instance on 624985 nodes and 54191395 edges broken down to 390639 connected components on which closeness centrality is set to run, the query is running for more than 30 min now. Do I have to switch gears and maybe try apoc.mapParallel2 as @alicia.frame suggested in this thread?

Thanks,
Lavanya

Topic		Replies	Views
Difference between calling "algo.closeness.stream" and "algo.closeness" for large graphs Graph Algorithms/Graph Data Science	3	948	April 28, 2020
Out of time error while running closeness centrality algorithm Graph Algorithms/Graph Data Science	0	391	March 4, 2020
Updating many nodes in large graph consumes all memory and crashes Procedures & APOC performance , cypher	4	744	October 29, 2020
Using Graph Algorithms to Detect Supernodes Graph Algorithms/Graph Data Science apoc	2	903	March 16, 2023
Parallel Cypher & Apoc Cypher apoc , cypher	8	3937	June 19, 2019

Get Certified in June!

Running graph algorithms with APOC periodic.iterate

Related topics