Creation of relationship in bulk using apoc.periodic.iterate

jishadp8606 · August 10, 2023, 2:24am

I'm trying to create relationship between nodes of same label with some condition. I have tried all method listed in this community but none doesn't work for my use case. The code below running without error but no relationship is created. Also I need to add one more condition of check if the sum of the levenshtein distance between properties of each node is less than a specific value, then only the relationship is created. I'm not sure at which part that condition should be given.

CALL apoc.periodic.iterate("
MATCH (n:sampl2000)
WITH collect(n) as users
WITH users
UNWIND users as u
RETURN u, users
",
"
WITH u, users
FOREACH(user in users|
FOREACH( n in CASE WHEN id(u)<id(user) THEN [1] ELSE END|
MERGE (u)-[r:SAME_USER2K]-(user)
))
",
{batchSize:100,parallel:true, retries: 3 }) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

glilienfield · August 10, 2023, 2:57pm

I got this to create the relationships. Note, I had removed the node label so it worked with my data.

CALL apoc.periodic.iterate("
MATCH (n)
WITH collect(n) as users
WITH users
UNWIND users as u
RETURN u, users
",
"
with u, users
FOREACH(user in users | 
    FOREACH( n in CASE WHEN id(u)<id(user) THEN [1] ELSE [] END | 
        MERGE (u)-[r:SAME_USER2K]-(user)))
",
{batchSize:100,parallel:true, retries: 3 }) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

What I gather from the query is that you want to create a relationship between each pair nodes. I think the following is easier to understand, less complex, and more efficient. It also allows you to easily add additional 'where' predicates as you want to do.

CALL apoc.periodic.iterate("
MATCH (n)
WITH collect(n) as users
UNWIND users as a
UNWIND users as b
WITH a, b
WHERE id(a) < id(b)
//add additional predicates to filter nodes out. 
RETURN a, b
",
"
MERGE (a)-[r:SAME_USER2K]-(b)
",
{batchSize:100,parallel:false, retries: 3 }) 
YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

Note, I was experiencing locks with running it in parallel. I switched parallel execution off and it ran. It run significantly faster than your original query even without parallel execution.

jishadp8606 · August 11, 2023, 8:50am

But for me when I tried, it is slower than the following:
driver = GraphDatabase.driver(uri, auth=(username, password))
with driver.session() as session:
query = (
"""
MATCH (p1:sample26K) WITH p1
MATCH (p2:sample26K)
WHERE id(p1)<id(p2) AND
(apoc.text.levenshteinDistance(p1.email, p2.email) +
apoc.text.levenshteinDistance(p1.phone, p2.phone) +
apoc.text.levenshteinDistance(p1.mobilephone, p2.mobilephone) +
apoc.text.levenshteinDistance(p1.street, p2.street)) < 40
MERGE (p1)-[:SAME_USER26K]->(p2)"""
)
session.run(query)

valerio.malenchino · August 11, 2023, 3:17pm

Hi,
I don't know which version of Neo4j you are using, but if it supports CALL{} IN TRANSACTIONS (>=4.4?) there is a simpler and faster way of writing that query.

CALL {
    MATCH (n)
    MATCH (m WHERE id(n) < id(m))
    MERGE (m)-[r:SAME_USER2K]-(n)
} IN TRANSACTIONS OF 100 ROWS

(if you are trying it out using Browser, you need to prefix the query with :auto)

I used Neo4j 5.10, generated the data from :play movies, and on my laptop I get:
apoc query parallel: 295ms
apoc query serial: 844ms
pure Cypher: 143ms.

All query versions generate 14535 relationships, so I hope they do the same thing, didn't spend much time checking.

In my opinion, the pure Cypher query is much simpler.

valerio.malenchino · August 11, 2023, 3:27pm

The straightforward Cypher query:

MATCH (n)
MATCH (m WHERE id(n) < id(m))
MERGE (m)-[r:SAME_USER2K]-(n)

ends in 140ms, so are you sure you need to split it into batches? Simple is beautiful.

glilienfield · August 11, 2023, 3:27pm

I agree, but some people like using apoc. I avoid it unless necessary. The call subquery in transactions is much cleaner.

jishadp8606 · August 22, 2023, 4:38am

I've tried with CALL and without CALL option. Both of them are not scalable for large quantities of nodes. For 24K nodes, its taking around 20 minutes

finbar.good · September 13, 2023, 2:19pm

I tried out the apoc.periodic.iterate function versus the CALL { } IN TRANSACTIONS OF N ROWS on a dataset with 24k nodes, and the latter was consistently faster.

Other performance improvement options: (1) narrow the Levenshtein distance for considering a pair of nodes the same users (2) replace MERGE with CREATE.

But the real issue is the sheer number of pairwise comparisons. They scale as the square of the results from MATCH (n:sampl2000). For matches in the order of 10^4, there will be 10^8 pairwise comparisons at the Levenshtein calculation step. Even if only 1% of pairs pass that filter, there will still be 10^6 relationships to create.

Topic		Replies	Views
Reliably create relationships on 12million+ nodes Cypher	6	828	August 7, 2020
Creating relationships efficiently using Apoc Cypher apoc , performance , import	1	332	March 15, 2022
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2870	September 24, 2020
Bulk creation of relationships on existing nodes Procedures & APOC import	2	1628	June 3, 2021
Creating single relationship instead of two in order to drive efficiency Neo4j Graph Platform migrated	12	474	August 26, 2022

July Summer Fun!

Creation of relationship in bulk using apoc.periodic.iterate

Related topics