I found a solution for now...
Problem Description
In my graph, some Person
nodes are already assigned to clusters. These clusters are represented by dedicated Cluster
nodes that connect to all members of that group.
Now, I want to take not clustered persons from a given list of IDs and group them into new clusters based on their graph connectivity โ i.e., if they're connected through any path (via relationships like HAS_FRIEND
, LIVES_IN
, etc.).
The tricky part is: if you naively process all persons in the list, each node could end up creating its own cluster with all the others, even if they're already part of the same connected subgraph. For example, if there are 100 connected nodes in the list, Person 1 clusters with the other 99, Person 2 does the same, and so on โ which obviously defeats the purpose.
Solution Strategy
To prevent that, I do the following:
- I start with a list of unclustered person IDs.
- For each ID, I retrieve the full connected subgraph of
Person
nodes (using apoc.path.subgraphNodes
).
- I then sort the found nodes by their
Id
and select the one with the lowest ID as the designated "cluster starter".
- Only those lowest-ID nodes are used for further cluster creation logic. This ensures each cluster is only created once per connected group.
Query
// Example list of unclustered person IDs
WITH [101,108,104,106,107,110,112] AS personIds
UNWIND personIds AS id
MATCH (startNode:Person {Id: id})
CALL apoc.path.subgraphNodes(startNode, {
minLevel: 0,
uniqueness: "NODE_GLOBAL",
relationshipFilter: "HAS_FRIEND|LIVES_IN|..." // Include all relevant relationship types
}) YIELD node
WHERE node:Person
WITH node, id
ORDER BY node.Id
WITH head(collect(node)) AS starterNode, id
WITH collect(DISTINCT starterNode) AS clusterStarters
RETURN clusterStarters
What this does
This returns a list of the lowest-ID Person
nodes from each connected group. These can then be used as the entry points for actual cluster generation, avoiding duplication and inefficiency.
Let me know if you see any optimization potential โ especially for large graphs, or if you know of a more elegant way to do the work.
Cheers!