I am trying to run closeness centrality on 624985 nodes and 54191395 edges with the following query (which worked fine for a smaller instance on 3566981 edges within minutes):
This query is running for more than 2 hours and has not written the graph properties for any nodes. What are the next steps I could take to make sure if it will run ok? Is there a way to make this run with any APOC procedure?
to see if I can check any memory requirements for running the algorithm, with no luck:
Error
Neo.ClientError.Procedure.ProcedureCallFailed
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `algo.memrec`: Caused by: java.lang.IllegalArgumentException: The procedure [algo.closeness] does not support memrec or does not exist, the available and supported procedures are {beta.k1coloring, beta.modularityOptimization, beta.wcc, graph.load, labelPropagation, louvain, nodeSimilarity, pageRank, unionFind, wcc}.
This is probably not an answer. I am just sharing my experience with running graph algorithms against large graphs:
algo.memrec is not supported for all algorithms. You can see a list of the supported ones in the error message you got.
Make sure you give neo4j large enough heap and pagecache.
Use concurrency parameter if you have the enterprise version.
Run the algorithm against a small portion of your graph first to see how it performs.
Use community detection algorithms (or any other algorithm) to break your graph into smaller subgraphs and then run closeness separately on every subgraph.
I now computed the connected components of the graph and want to try to speed up the below query by running the below query for each individual component:
CALL algo.unionFind('alias', 'through_citations', {graph:'huge', seedProperty:'GraphProperty_wcc_throughCitations', write:true, writeProperty:'GraphProperty_wcc_throughCitations'})
YIELD nodes AS Nodes, setCount AS NbrOfComponents, writeProperty AS PropertyName;
Is cypher projection the only way out? I do not think the below code is parallelizing the computations, does it?
CALL algo.closeness('MATCH (n:alias) RETURN id(n) AS id',
'MATCH (n)-[:through_citations]-(m:alias) where n.GraphProperty_wcc_throughCitations == GraphProperty_wcc_throughCitations RETURN id(n) AS source, id(m) AS target', {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
YIELD nodes,loadMillis, computeMillis, writeMillis;
Set a label an additional label on each alias with its community id, and then loop over each community label and run closeness centrality on that subset, or
Use a cypher statement to identify all the labels, and loop over the communities using a cypher projection. Your query is almost there but you'll need to update the node and relationship queries:
CALL algo.closeness('MATCH (n:alias) WHERE n.GraphProperty_wcc_throughCitations = [value] RETURN id(n) AS id',
'MATCH (n)-[:through_citations]-(m:alias) where n.GraphProperty_wcc_throughCitations = [value] RETURN id(n) AS source, id(m) AS target',
{graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
Closeness centrality is parallelized, but if you want to make the loop over the communities itself parallel, you could use something like apoc.mapParallel2.
I would probably use a threshold and ignore any components with fewer than (for example) 5 members just to limit the number of communities you inspect (and if they're small the closeness will be low anyways).
MATCH (n:alias) WITH DISTINCT n.GraphProperty_wcc_coauthors as value
CALL algo.closeness('MATCH (n:alias) WHERE n.GraphProperty_wcc_coauthors = $value RETURN id(n) AS id',
'MATCH (n)-[:co_authors]-(m:alias) where m.GraphProperty_wcc_coauthors = $value RETURN id(n) AS source, id(m) AS target',
{graph:'cypher', params: {value: value}, write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
YIELD nodes,loadMillis, computeMillis, writeMillis
RETURN nodes,loadMillis, computeMillis, writeMillis
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `algo.closeness`: Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
CALL apoc.periodic.iterate(
"MATCH (comp:GraphProperty_wcc_throughTopic) RETURN comp.GraphProperty_component AS component",
"CALL algo.closeness('MATCH (n:alias {GraphProperty_wcc_throughTopic : $component}) RETURN id(n) AS id',
'MATCH (n)-[r:through_topic]-(m:alias) RETURN id(n) AS source, id(m) AS target, r.weight as weight',
{graph:'cypher', params: {component: component}, write:true, writeProperty:'GraphProperty_closeness_centrality_throughTopic'})
YIELD nodes,loadMillis, computeMillis, writeMillis
RETURN nodes,loadMillis, computeMillis, writeMillis", {batchSize:5000, parallel:true})
YIELD batches, total, errorMessages;
I got this above query working for smaller instance. For my bigger instance on 624985 nodes and 54191395 edges broken down to 390639 connected components on which closeness centrality is set to run, the query is running for more than 30 min now. Do I have to switch gears and maybe try apoc.mapParallel2 as @alicia.frame suggested in this thread?