Entity resolution at scale

Node Link


I am a newbie at Neo4j.  My team and I have a graph more than 118 million person and address entities. We would like to run similarity graph algorithms and functions to determine which entities are the same (entity resolution).  Node Similarity has time complexity O(n3) and space complexity O(n2) to quote from the documentation. 

How do we run these algorithms at scale? Do we find communities first and them run the algorithms against a subsection of the graph? Would we loose precision? 

Does Neo4j scale horizontally -can we distribute the work? 

Any assistance would be greatly appreciated. 




Hi @rc !! Welcome to Neo4j!

There's different ways to do this, but it really depends on the use case. You can definitely generate communities firstly to separate your data based on specific relationships to specific nodes. This would mean that you're only interested on similarities inside the communities; you wouldn't get loss of precision as the algorithm uses the graph structure and not some ML training-like pipeline (although, you can do ML pipelines with GDS). I'd recommend to generate these communities to understand a bit more of your graph structure and then run similarities in the grreatest community. But, again it depends on the use case.

So, maybe if you share a bit more of the use case we can get to a better solution.

Currently GDS only works on single instances, you can have it in a cluster, but the GDS workload would need to be done in a single machine.


Thanks for your response. We need to find which people are 'the same'. We need to compare whether the person has the same surname, forename, dob and postcode and if they are an exact match we can say they are the same.  We have a model where each of these are a node i.e. they are not properties of a person or address.

We were thinking of using a weighted similarity algorithm on  the surname, forename, dob and postcode nodes i.e. a match on forename and dob can have a lower weight than surname and dob etc.

If would like to optimise it by running the algorithm by community/subgraphs


Graph Maven

Select one Address node and compare its 'address' property value with all other Address node 'address' values using apoc.text.jaroWinklerDistance algorithm. From this you can get similarity strength values. Then you can set your limits on the similarity strengths to classify the Address node similarity.

thank you! I will try this. How would we scale this as comparing each node to all other nodes has a time complexity of O(n2)?