I'm trying to create relationship between nodes of same label with some condition. I have tried all method listed in this community but none doesn't work for my use case. The code below running without error but no relationship is created. Also I need to add one more condition of check if the sum of the levenshtein distance between properties of each node is less than a specific value, then only the relationship is created. I'm not sure at which part that condition should be given.
CALL apoc.periodic.iterate("
MATCH (n:sampl2000)
WITH collect(n) as users
WITH users
UNWIND users as u
RETURN u, users
",
"
WITH u, users
FOREACH(user in users|
FOREACH( n in CASE WHEN id(u)<id(user) THEN [1] ELSE END|
MERGE (u)-[r:SAME_USER2K]-(user)
))
",
{batchSize:100,parallel:true, retries: 3 }) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
I got this to create the relationships. Note, I had removed the node label so it worked with my data.
CALL apoc.periodic.iterate("
MATCH (n)
WITH collect(n) as users
WITH users
UNWIND users as u
RETURN u, users
",
"
with u, users
FOREACH(user in users |
FOREACH( n in CASE WHEN id(u)<id(user) THEN [1] ELSE [] END |
MERGE (u)-[r:SAME_USER2K]-(user)))
",
{batchSize:100,parallel:true, retries: 3 }) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
What I gather from the query is that you want to create a relationship between each pair nodes. I think the following is easier to understand, less complex, and more efficient. It also allows you to easily add additional 'where' predicates as you want to do.
CALL apoc.periodic.iterate("
MATCH (n)
WITH collect(n) as users
UNWIND users as a
UNWIND users as b
WITH a, b
WHERE id(a) < id(b)
//add additional predicates to filter nodes out.
RETURN a, b
",
"
MERGE (a)-[r:SAME_USER2K]-(b)
",
{batchSize:100,parallel:false, retries: 3 })
YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Note, I was experiencing locks with running it in parallel. I switched parallel execution off and it ran. It run significantly faster than your original query even without parallel execution.
But for me when I tried, it is slower than the following:
driver = GraphDatabase.driver(uri, auth=(username, password))
with driver.session() as session:
query = (
"""
MATCH (p1:sample26K) WITH p1
MATCH (p2:sample26K)
WHERE id(p1)<id(p2) AND
(apoc.text.levenshteinDistance(p1.email, p2.email) +
apoc.text.levenshteinDistance(p1.phone, p2.phone) +
apoc.text.levenshteinDistance(p1.mobilephone, p2.mobilephone) +
apoc.text.levenshteinDistance(p1.street, p2.street)) < 40
MERGE (p1)-[:SAME_USER26K]->(p2)"""
)
session.run(query)
Hi,
I don't know which version of Neo4j you are using, but if it supports CALL{} IN TRANSACTIONS (>=4.4?) there is a simpler and faster way of writing that query.
CALL {
MATCH (n)
MATCH (m WHERE id(n) < id(m))
MERGE (m)-[r:SAME_USER2K]-(n)
} IN TRANSACTIONS OF 100 ROWS
(if you are trying it out using Browser, you need to prefix the query with :auto)
I used Neo4j 5.10, generated the data from :play movies, and on my laptop I get:
apoc query parallel: 295ms
apoc query serial: 844ms
pure Cypher: 143ms.
All query versions generate 14535 relationships, so I hope they do the same thing, didn't spend much time checking.
In my opinion, the pure Cypher query is much simpler.
The straightforward Cypher query:
MATCH (n)
MATCH (m WHERE id(n) < id(m))
MERGE (m)-[r:SAME_USER2K]-(n)
ends in 140ms, so are you sure you need to split it into batches? Simple is beautiful.
I agree, but some people like using apoc. I avoid it unless necessary. The call subquery in transactions is much cleaner.
I've tried with CALL and without CALL option. Both of them are not scalable for large quantities of nodes. For 24K nodes, its taking around 20 minutes
I tried out the apoc.periodic.iterate
function versus the CALL { } IN TRANSACTIONS OF N ROWS
on a dataset with 24k nodes, and the latter was consistently faster.
Other performance improvement options: (1) narrow the Levenshtein distance for considering a pair of nodes the same users (2) replace MERGE
with CREATE
.
But the real issue is the sheer number of pairwise comparisons. They scale as the square of the results from MATCH (n:sampl2000)
. For matches in the order of 10^4, there will be 10^8 pairwise comparisons at the Levenshtein calculation step. Even if only 1% of pairs pass that filter, there will still be 10^6 relationships to create.