Best Way to Parallelize Graph Clustering Algorithms

Hello,

I'm trying to speed up a lot of leiden clustering queries I want to run on a knowledge graph.

Basically, I want to be able to project a subgraph and stream the cluster assignments in parallel. I'm not writing to the graph, so it seems like this should be possible.

CALL gds.leiden.stream()
YIELD nodeId, communityId
RETURN nodeId, communityId

I tried the CYPHER runtime = parallel and it appears to be faster, but I'm not seeing full utilization of available CPUs. I've also seen IN CONCURRENT TRANSACTION

WITH %s AS names
UNWIND names AS name
CALL (name) {
CALL gds.leiden.stream(name)
YIELD nodeId, communityId
RETURN nodeId, communityId
} IN 5 CONCURRENT TRANSACTIONS
RETURN *

But this says that IN TRANSACTIONS does not work for explicit transactions. Is there something I am missing or is it not possible to run leiden with IN CONCURRENT TRANSACTIONS?

What is the recommended way to run leiden clustering in parallel?

Thanks!

neo4j 5.28.1
neo4j-rust-ext 5.28.1.0

Enterprise Neo4j
|Version:|5.25.1|

Hi @jacob.pfeil,

How many cores does your system have available? Without specifying concurrency when calling an algorithm, it is set to 4 by default. Perhaps you can try increasing it e.g., CALL gds.leiden.stream(name, {concurrency: 10})

Note also that Leiden is not fully parallelized yet, there are some steps that run single-threaded. As a test, you could replace with louvain to see if that leads to more core utilization.

This is all from a gds perspective. Perhaps you could try asking in the cypher subsection should know more about optimizing parallel tasks.

Best regards,
Ioannis.

is there a reason that you can not issue multiple transactions concurrent transactions?

e.g., issue the query below multiple times, concurrently?

CALL gds.leiden.stream()
YIELD nodeId, communityId
RETURN nodeId, communityId

with such a simple query I'm surprised parallel runtime gets you any speedup at all, to be honest, because even in parallel runtime the stream of results is serialized.

Thanks @alex.averbuch. I'm trying to run ~4 millions clustering calls. To run them serially would take around 10 days on the server I am running, so parallelization would be really helpful here. When I run the concurrent transactions call I get this error:

neo4j.exceptions.DatabaseError: {code: Neo.DatabaseError.Transaction.TransactionStartFailed} {message: A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.}