Create edge using apoc.periodic.iterate suffer from Cartesian product

Peter_Lian · January 5, 2023, 10:21am

The following is the cypher which I run

CALL apoc.periodic.iterate("MATCH(e:User), (f:User) WHERE e.buyid = f.sellid RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

It suffer from something maybe Cartesian product since it takes too much time with no any result showed in 10 billion node but 5 billion still do. How can I alter the cypher so that the problem can be solved?

Remark : The node must be named as "User" for both seller and buyer.

Thanks.

glilienfield · January 5, 2023, 3:02pm

Do you have indexes created for these two properties?

create index user_sellid if not exists for (n:User) on n.sellid;
create index user_buyid if not exists for (n:User) on n.buyid;

Peter_Lian · January 6, 2023, 12:23am

@glilienfield

No, but excuse me, should I Create index before or after creating node ?

glilienfield · January 6, 2023, 1:54am

You should create them as early as possible, so they are leveraged as needed. Anyway, the indexes will be built in the background and will come online when finished. The above query should run faster, as well as other queries that use these properties.

Peter_Lian · January 6, 2023, 3:33am

Although Create index can return the result successfully in 20min, but it spend too much time on creating index not only before but also after creating node, my data need to be delete/add dynamic everyday it's not what I want.

I tried the following cypher and it work successfully, share with you !

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid}) RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

It take only 10 min without the help of index. Although create index must can improve the speed, but I don't due to create index take too much time.

However, are there any method that I can avoid Match two times? For example,

CALL apoc.periodic.iterate("MATCH(e:User) where e.buyid=e.sellid RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(e)",

{batchSize:10000, parallel: true}) YIELD batch

But the result of edge (sell_product_to) would duplicate two times...

Thanks.

glilienfield · January 6, 2023, 3:50am

I figured it would take a while to create the initial index because you have a lot of nodes. Did it really take long to save a new node once the indexes where online?

Your issue is that you are asking to find all pairs of nodes that match your criteria, then you create a relationship between each pair of nodes. Can you create the relationship when you add the nodes, instead of after all the nodes are entered?

Peter_Lian · January 6, 2023, 6:07am

For the first one, I would make some test and show the result for you.

For the second, how can I create the relationship when add the node?

The cypher that I add the node is the following :

CALL{

CALL apoc.periodic.iterate('

CALL apoc.load.csv("user.csv") YIELD value return value','

WITH value

CREATE(User:user{sellid:user.sellid, buyid:user.buyid, name:user.name})',

{batchSize:10000, iterateList:true, parallel:true}) YIELD batches

}

Peter_Lian · January 6, 2023, 8:00am

@glilienfield It's strange that the index do not improve but even worsen.

I try the following

case1

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid}) RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

case 2

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid}) USING INDEX f:User(sellid) RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

CASE 1 : 10 MIN

CASE 2 : 12 MIN

Why ? (in both case all 0.1 billion node)

B.T.W., case 3

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid}) USING INDEX e:User(buyid) RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

Fail, show

Failed to invoke procedure `apoc.periodic.iterate`: Caused by: org.neo4j.exceptions.SyntaxException: Cannot use index hint `USING INDEX e:User(buyid)` in this context: Must use label `User`, that the hint is referring to, on the node `e` either in the pattern or in supported predicates in `WHERE` (either directly or as part of a top-level `AND` or `OR`), but no label was found. Predicates must include the label literal `User`. That is, the function `labels()` is not compatible with indexes. Note that label `User` must be specified on a non-optional node

glilienfield · January 6, 2023, 10:15am

Did you run each case multiple times? What if you go back to your original query:

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User)

WHERE e.buyid = f.sell if

RETURN e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

any difference. It may be the same. I would have to look at the query plan, which I can’t do on my phone.

the index must be due to not having a predicate based one the index you hinted to use.

Topic		Replies	Views
Creating Large Number of Edges with `apoc.periodic.iterate` Cypher apoc , performance , cypher	3	419	October 24, 2023
Apoc.periodic.iterate creating multiple edges Cypher apoc	3	264	March 24, 2022
Improving apoc.periodic.iterate performance for MATCH/CREATE Neo4j Graph Platform migrated	6	271	August 23, 2022
Optimization of Cypher query to create nodes Cypher apoc , performance , cypher , operations	1	232	September 28, 2021
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2852	September 24, 2020

Get Certified in June!

Create edge using apoc.periodic.iterate suffer from Cartesian product

Related topics