Hello
I'm currently looking for a way to split up larger detected community or a way to identify nodes that connect large sub-clusters on their own.
For some context:
The graph contains two Nodes: Case and Person.
- A Case changes a Person (or several Persons)
- A Person Relates to another Person
There are around 30 Mio Cases. And around 10 Mio Persons
I've read through the available algorithms Community detection - Neo4j Graph Data Science and settled for "Weakly Connected Components" and "Label Propagation" for now. The WCC alorithm splits the Person clusters apart the way I need it to. To build clusters only the Person nodes are relevant. Currently I have a small app that copies the essential data into Neo4j to do these experiments.
In the end, the cases need to be sorted by their date (all cases in one cluster) to be able to transfer data in parallel but in their correct order. But business rules on Person would fail if nodes within a cluster would be transfered in parallel.
In an abstract view:
(properties and relations are all the same but some left out for a cleaner diagram)
There are expectedly several thousand of those person clusters.
But to parallelise the transfer it would have been nice to have similar sized clusters or groups of clusters.
Sadly the largest cluster contains around 70% of all Persons.
I now wonder if there is a way to identify nodes in that huge cluster that connect large sub-clusters by their own, so I might be able to sort those sub clusters for parallel transer.
The label propagation algorithm splits up the clusters way more. But the result is hard to analyse for me.
I need to make sure the all Cases are transferd in the right order. Would I transfer clusters in parallel that have "bad" edges a business rule that relies on Person order existence fail the whole process. So none of the Person->Person edges can really be ignored.
Having identified that large cluster, is there a way to identify such Person nodes?
Or am I using the wrong alorithms anyway?
Any hint is appreciated