Hello everyone,
I'm trying to build a device graph where I have lots of different user identifiers connected together to a user node.
My goal is to group up the user nodes "Rows" who share at least 1 identifier into a "family" node.
Example of raw data:
As you can see, row 1 and 2 share an identifier
rows 2,4 share an identifier, and rows 3,4 share an identifier
My ultimate goal would be creating a family node with the same index as the minimum row index, hence "1".
The scale is approximately 36M nodes (11.5M row nodes, rest are unique identifiers), and 46M connections.
I know the initial load is significant, but the incremental uploads would be much smaller.
After running into multiple memory issues the best I was able to achieve is :
CALL apoc.periodic.iterate("UNWIND range(1,11500000) as id return id",
'MATCH (a:Row {index:id})-[:USES]->(c)<-[:USES]-(b:Row) with a, collect(distinct b) as familyMembers , case when a.index < min(b.index) then a.index else min(b.index) end as min_index_final MERGE(f:Family {index: min_index_final}) MERGE (a)-[:BELONGS_TO]->(f) with min_index_final, familyMembers,f UNWIND familyMembers as member MERGE (member)-[r:BELONGS_TO]->(f)',{batchSize:5000})
Basically, iterating through all the row nodes, finding first degree connected rows and creating a master node sharing the index of the lowest number in the cluster (to keep it deterministic).
the result : (in the comment since I can't post 2 pictures)
As you can see there are two issues here
- Redundant families were created. When iterating on rows 1 and 2, everything is smooth and nodes 1,2,4 got connected to family 1. When iterating on node 3 and 4 - 1 wasn't an immediate relation so it couldn't grab that index , resulting in additional families. I could potentially clean it up later but still 3 wouldn't connect to 1 without extending the relationship degree.
- When applied on millions, it takes forever. It took me 4 hours to go through 600k rows out of 11.5M.
Would love to hear if there's anything I'm doing wrong, or anything that could make it smarter/ faster as I'm running out of ideas.
Thanks