Creating a family node for connected nodes

dors · May 29, 2023, 10:17pm

Hello everyone,
I'm trying to build a device graph where I have lots of different user identifiers connected together to a user node.
My goal is to group up the user nodes "Rows" who share at least 1 identifier into a "family" node.

Example of raw data:

As you can see, row 1 and 2 share an identifier
rows 2,4 share an identifier, and rows 3,4 share an identifier
My ultimate goal would be creating a family node with the same index as the minimum row index, hence "1".
The scale is approximately 36M nodes (11.5M row nodes, rest are unique identifiers), and 46M connections.
I know the initial load is significant, but the incremental uploads would be much smaller.

After running into multiple memory issues the best I was able to achieve is :

CALL apoc.periodic.iterate("UNWIND range(1,11500000) as id return id",
'MATCH (a:Row {index:id})-[:USES]->(c)<-[:USES]-(b:Row)   with a, collect(distinct b) as familyMembers , case when a.index < min(b.index) then a.index else min(b.index) end as min_index_final  MERGE(f:Family {index: min_index_final}) MERGE (a)-[:BELONGS_TO]->(f) with min_index_final, familyMembers,f UNWIND familyMembers as member MERGE (member)-[r:BELONGS_TO]->(f)',{batchSize:5000})

Basically, iterating through all the row nodes, finding first degree connected rows and creating a master node sharing the index of the lowest number in the cluster (to keep it deterministic).

the result : (in the comment since I can't post 2 pictures)

As you can see there are two issues here

Redundant families were created. When iterating on rows 1 and 2, everything is smooth and nodes 1,2,4 got connected to family 1. When iterating on node 3 and 4 - 1 wasn't an immediate relation so it couldn't grab that index , resulting in additional families. I could potentially clean it up later but still 3 wouldn't connect to 1 without extending the relationship degree.
When applied on millions, it takes forever. It took me 4 hours to go through 600k rows out of 11.5M.

Would love to hear if there's anything I'm doing wrong, or anything that could make it smarter/ faster as I'm running out of ideas.
Thanks

dors · May 29, 2023, 10:18pm

dors · May 30, 2023, 11:24am

I was able to achieve the end result using immediate connections between the Row nodes first, and then looking for 2nd degree connections.
This works on a small subset of the data, but when trying to apply it on the whole dataset it's not working too well.
How can this be more efficient? maybe deleting the "RELATED" connections while creating the families?

query:

CALL apoc.periodic.iterate("UNWIND range(1,4) as id return id",
'MATCH (a:Row {index:id})-[r:RELATED*..2]->(b:Row) where b.index < id  with a, collect(distinct b) as familyMembers , case when a.index < min(b.index) then a.index else min(b.index) end as min_index_final  MERGE(f:Family {index: min_index_final}) MERGE (a)-[:BELONGS_TO]->(f) with min_index_final, familyMembers,f UNWIND familyMembers as member MERGE (member)-[r:BELONGS_TO]->(f)',{batchSize:5000})

Topic		Replies	Views
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2863	September 24, 2020
Clustering of nodes. Combining nodes based on commonality with other nodes Cypher apoc , performance , cypher , relationship	2	345	August 24, 2021
Create a unique id for set of nodes Cypher apoc , cypher	6	4501	December 16, 2019
Creating relationship between millions of nodes and runnning out of heap memory Cypher apoc , cypher	9	1836	February 20, 2020
Performance Issues Merging Nodes Cypher apoc , performance , cypher	3	348	March 13, 2022

Get Certified in June!

Creating a family node for connected nodes

Related topics