Entity resolution at scale

rc1 · July 6, 2022, 11:30am

hi,

I am a newbie at Neo4j. My team and I have a graph more than 118 million person and address entities. We would like to run similarity graph algorithms and functions to determine which entities are the same (entity resolution). Node Similarity has time complexity O(n3) and space complexity O(n2) to quote from the documentation.

How do we run these algorithms at scale? Do we find communities first and them run the algorithms against a subsection of the graph? Would we loose precision?

Does Neo4j scale horizontally -can we distribute the work?

Any assistance would be greatly appreciated.

Thanks

ameyasoft · July 7, 2022, 7:46am

Select one Address node and compare its 'address' property value with all other Address node 'address' values using apoc.text.jaroWinklerDistance algorithm. From this you can get similarity strength values. Then you can set your limits on the similarity strengths to classify the Address node similarity.

luiseduardo · July 6, 2022, 11:13pm

Hi @rc !! Welcome to Neo4j!

There's different ways to do this, but it really depends on the use case. You can definitely generate communities firstly to separate your data based on specific relationships to specific nodes. This would mean that you're only interested on similarities inside the communities; you wouldn't get loss of precision as the algorithm uses the graph structure and not some ML training-like pipeline (although, you can do ML pipelines with GDS). I'd recommend to generate these communities to understand a bit more of your graph structure and then run similarities in the grreatest community. But, again it depends on the use case.

So, maybe if you share a bit more of the use case we can get to a better solution.

Currently GDS only works on single instances, you can have it in a cluster, but the GDS workload would need to be done in a single machine.

Cheers!

rc1 · July 7, 2022, 7:59am

thank you! I will try this. How would we scale this as comparing each node to all other nodes has a time complexity of O(n2)?

rc1 · July 7, 2022, 7:56am

Thanks for your response. We need to find which people are 'the same'. We need to compare whether the person has the same surname, forename, dob and postcode and if they are an exact match we can say they are the same. We have a model where each of these are a node i.e. they are not properties of a person or address.

We were thinking of using a weighted similarity algorithm on the surname, forename, dob and postcode nodes i.e. a match on forename and dob can have a lower weight than surname and dob etc.

If would like to optimise it by running the algorithm by community/subgraphs

Topic		Replies	Views
Cosine similarity on 1M person nodes Neo4j Graph Platform migrated	5	1013	August 22, 2023
Need idea about Cypher query writing Operations operations	1	429	July 27, 2020
(no changes, no records) for similarity score Neo4j Graph Platform	4	207	February 25, 2023
Can this query be optimized? GDS Node similarity Graph Algorithms/Graph Data Science performance , cypher , operations	3	742	August 3, 2021
Find similarity of given node with entire graph Neo4j Graph Platform migrated	9	273	December 8, 2022

Entity resolution at scale

Related topics