WCC / Union.find algorithm - Help Requested (Large Database)

32una8524 · January 23, 2021, 6:50pm

Consider the following graph which I created using the following:

#Create Nodes for testing
create (n1:person {name:'bob'})-[:SOME_RELATIONSHIP]->(c1:ARBITRARY_LABEL {id:'1'})
create (n11:person {name:'jane'})-[:SOME_RELATIONSHIP]->(c1)
create (n2:person {name:'jack'})-[:SOME_RELATIONSHIP]->(c2:ARBITRARY_LABEL {id:'2'})
create (n22:person {name:'steve'})-[:SOME_RELATIONSHIP]->(c2)
create (n23:person {name:'jane'})-[:SOME_RELATIONSHIP]->(c2)
create (n3:person {name:'joseph'})-[:SOME_RELATIONSHIP]->(c3:ARBITRARY_LABEL {id:'3'})
create (n4:person {name:'jack'})-[:SOME_RELATIONSHIP]->(c3)
return *

Graph Overview:

People connect to a random green node via some arbitrary relationship.
There are thousands of these connections within the current graph.
It is apparent that there are duplicate people nodes with the same name connected throughout various parts of the graph.
For example, Jane is connected to 1 and also to 2.
Jack is connected to 2 and also to 3.
The others however only have a single connection to the arbitrary green nodes.

Objective

The objective is to perform a query with what I think is called (WCC - formerly union.find) or some other similar query that will dynamically search throughout the entire database to find other occurrences of each of the names and produce a single cluster for only those relationships where there is overlap.
To better describe what the final outcome should look like let's try walking through a quick example. The algorithm will look at Bob and discover that there are no other occurrences of Bob in the graph and will then move on to Jane. Because Jane is connected to 1 and 2 it merges everyone connected to 1 and 2 together and points them all to a new UNIQUELY generated node using a property as a constraint with some randomly generated id . 1 and 2 are discarded after that point as there is no use for them in the graph anymore. The duplicate node for Jane is dropped as there should only be a single occurrence of her in the graph.
At this point, bob, jane, steve, jack and jane all point to a new node. The algorithm then continues and finds that jack is connected to two separate nodes. The cycle repeats and the smaller set of names is merged into the one with the larger set of names. The final output should display a graph similar to the following:

Final Notes

I currently use apoc.merge but find that a database consisting of millions of these, the process is quite tedious and slow. There are several presentations on union.find and WCC which I think might be the better way to go. Any help would be greatly appreciated.

bratanic_tomaz · January 31, 2021, 10:08am

The WCC algorithm won't help you merge nodes. It might help you find the candidates for merging, but it seems you don't have a problem with that?

Topic		Replies	Views
After using unionFind I am getting a cluster with 10k node Neo4j Graph Platform apoc , cypher	0	808	December 5, 2018
unionFind doesn't work! Procedures & APOC	2	993	November 20, 2018
algo.unionFind() not giving me any clustering on subgraph Graph Algorithms/Graph Data Science	0	591	August 8, 2019
I am getting all nodes even with setting a value for "threshold" [unionFind] Neo4j Graph Platform	0	736	November 21, 2018
The problem with "apoc.coll.union" Procedures & APOC apoc	2	429	May 30, 2020

WCC / Union.find algorithm - Help Requested (Large Database)

Related topics