Hii Everyone ,
Here is my query for extracting the products for each customers match(p1:Customer) with COLLECT(distinct p1.CustomerId ) as ws unwind ws as ws1 match(p:Customer{CustomerId: ws1})-[x:InteractsWith]->(pr:Product)<-[:HasProduct]-(c:Category) with c.Category as Category, x.Date as date, p.CustomerId as Id order by date desc with collect(distinct Category)[..5] as category, Id unwind category as w2 match(c:Category{Category: w2})-[:HasProduct]->(pr:Product) return Id, w2, collect(pr.ProductId)[..3] as pid
there are 38000 customer nodes ,87000 product nodes and 219 category nodes .
Database is running with max heap size 15gb and pagecache size of about 4gb
And when i replace the 2nd line of query(with COLLECT(distinct p1.CustomerId ) as ws) by with COLLECT(distinct p1.CustomerId ) as ws [..2500](basically for 2500 customers) the query took around 4 minutes to execute and when i run it for all the customers there was memory error
hope you can help me out in optimizing the query
Thank-you
You're doing an unnecessary re-match to the :Customer nodes you already obtained. Just use p1, no need to rematch to p. This assumes CustomerId is unique per :Customer node (and if so you should have a uniqueness constraint on this).
Likewise, can we assume category is unique per :Category node? I'm going with that assumption for my suggestion.
Also no need to collect and unwind there.
MATCH (p1:Customer)
WITH p1
LIMIT 2500
MATCH (p1)-[x:InteractsWith]->(pr:Product)<-[:HasProduct]-(c:Category)
WITH c, x.Date as date, p1
ORDER BY date DESC
WITH collect(distinct c)[..5] as categories, p1
WITH categories, p1.CustomerId as Id
UNWIND categories as c
MATCH (c)-[:HasProduct]->(pr:Product)
WITH Id, c, collect(pr)[..3] as products
RETURN Id, c.Category as w2, [product in products | product.ProductId] as pid
thanks for sharing the query
i tried executing the query that you have shared and the query that i have shared for limit 2500, here are my time logs
Start Execution : Sat May 16 02:10:30 2020
Stop Execution : Sat May 16 02:11:23 2020 Time taken = 53 seconds
time logs for the query that i have shared
Start Execution : Sat May 16 02:13:11 2020
Stop Execution : Sat May 16 02:14:10 2020 Time Taken = 61
Now if i set the limit to 5000
i observed the elapsed time for query was around 350 seconds and then the error connecting to appeared and database stopped.
i want to run the code to limit 38000 , is it possible or do i need to do it in loop?
Hey @danny.oberoi
I have already set the inital and maximum heap size to 20 and 24GB respectively but still it is not running the query in one go for 38000 customers as of now i have executed it for 25000 customers in one go and it took ~7 minutes to complete