Clustering of nodes. Combining nodes based on commonality with other nodes

Hello community!

I have User and Group nodes. A user can be a member of any number of groups (or not a member of any) with a directed relationship IN_GROUP.

I want to find all users who are members of the same set of groups, create a separate Сluster node for them, create an IN_CLUSTER relationship between them and this Cluster node, and also create a RELATED relationship between the cluster and groups of these users.

Below are some screenshots of what I need:
I have users, each of which is in a specific set of groups:

As you can see, User_1, User_2 and User_3 have the same set of groups they belong to (Group_1, Group_2 and Group_3) - this is the first cluster. User_4 belongs to all groups - this is the second cluster. And the User_5 belongs to only one group - Group_5 - this is the third cluster.
Here's what we get:

Now we connect the clusters with users groups:

This is what I want to end up with:

I have some code that does the job, but its timing is unacceptable.

MATCH (u:User)
WITH [(u)-[:IN_GROUP]->(g:Group) | g] as groups, u
WITH apoc.coll.sortNodes(groups, "name") as groups, u
WITH apoc.util.md5(groups) as cluster_hash, groups, u
MERGE (c: Cluster {hash: cluster_hash})
CREATE (u)-[:IN_CLUSTER]->(c)
FOREACH (group IN groups |
MERGE (c)-[:RELATED]->(group))

On my dataset (several hundred thousand users and the same number of groups), this takes about 30 minutes to complete. I need a result in 5 seconds.

I'm able to use the apoc library.

Here's a cipher that creates a test data set from the above example:

Summary
CREATE (u1:User {name:"User_1"})
CREATE (u2:User {name:"User_2"})
CREATE (u3:User {name:"User_3"})
CREATE (u4:User {name:"User_4"})
CREATE (u5:User {name:"User_5"})

CREATE (g1:Group {name:"Group_1"})
CREATE (g2:Group {name:"Group_2"})
CREATE (g3:Group {name:"Group_3"})
CREATE (g4:Group {name:"Group_4"})
CREATE (g5:Group {name:"Group_5"})

MERGE (u1)-[:IN_GROUP]->(g1)
MERGE (u1)-[:IN_GROUP]->(g2)
MERGE (u1)-[:IN_GROUP]->(g3)

MERGE (u2)-[:IN_GROUP]->(g1)
MERGE (u2)-[:IN_GROUP]->(g2)
MERGE (u2)-[:IN_GROUP]->(g3)

MERGE (u3)-[:IN_GROUP]->(g1)
MERGE (u3)-[:IN_GROUP]->(g2)
MERGE (u3)-[:IN_GROUP]->(g3)

MERGE (u4)-[:IN_GROUP]->(g1)
MERGE (u4)-[:IN_GROUP]->(g2)
MERGE (u4)-[:IN_GROUP]->(g3)
MERGE (u4)-[:IN_GROUP]->(g4)
MERGE (u4)-[:IN_GROUP]->(g5)

MERGE (u5)-[:IN_GROUP]->(g5)

RETURN u1, u2, u3, u4, u5, g1, g2, g3, g4, g5

Neo4j version: 4.3.3

Hi @baturin.egor !

Can you try this small modification? It's hard to measure how much it helps on your complete db.

MATCH (u:User)-[:IN_GROUP]->(g:Group)
WITH collect(g) as groups, u
WITH distinct apoc.coll.sortNodes(groups, "name") as groups, collect(u) as users
WITH apoc.util.md5(groups) as cluster_hash, groups, users
MERGE (c: Cluster {hash: cluster_hash})
FOREACH (group IN groups |
MERGE (c)-[:RELATED]->(group))
FOREACH (u IN users |
CREATE (u)-[:IN_CLUSTER]->(c))

Lemme know if it helps a bit. Btw, Not sure if you can change turn those MERGE into CREATE as well

Bennu

Try this:
match (a:User)-[]-(b:Group)
with distinct id(a) as ID, collect(distinct id(b)) as grps
with distinct grps as n1, size(grps) as cnt order by cnt desc
match (d:Group) where id(d) in n1
with d, n1, cnt
merge (c:Cluster {name: ("Cluster" + " " + cnt)})
merge (c)-[:RELATED]->(d) 
return c, d

Result: