Qyery takes too much and fail after 2 hours of compiling

so i'm running the following query on 500k nodes:

load csv with headers from 'file:///df_keyword_urls.csv' as row
with collect({keyword:row.keyword , urls:row.urls}) as  ROW
unwind ROW as k1
    unwind ROW as k2
        with  k1.keyword as key1,apoc.convert.fromJsonList(k1.urls) as k1urls ,k2.keyword as key2 ,apoc.convert.fromJsonList(k2.urls) as k2urls
       where not k1=k2
       with key1,key2 , apoc.coll.intersection(k1urls,k2urls)as urls
       return key1 as keyword1, key2 as keyword2, urls as urls_intersections

and even with my 64 gigabytes of ram i still can't complete the tast after completing 3 hours of compiling , any help please to optimize this query i'm still new in cypher i'de love to have instructions to follow or where to look what to learn and what to read to optimize this query .
REGARDS !

Hi @wadie.almouhtadi

You can add USING PERIODIC COMMIT to the first line like this.
Adding this will speed up the process.
I think it will work if you have about 2G of heap memory. Not 64G.

Neo4j Browser

:auto USING PERIODIC COMMIT
load csv with headers from 'file:///df_keyword_urls.csv' as row

Cypher Shell

USING PERIODIC COMMIT
load csv with headers from 'file:///df_keyword_urls.csv' as row
:auto USING PERIODIC COMMIT

doesn't work unless we modify in our graph model , i tried it and it didn't work .

I don't think this is a good fit for a query.

Performing collect() here means you are materializing all rows of your CSV into heap memory at once. This means you cannot use periodic commit here, or rather that it won't do you any good because of how you're handling the entirety of your data all at once.

You're also doing an UNWIND twice, meaning that you are getting the cross product of every row of your CSV against every row, and this is also happening all at once in your heap memory.

A better, less memory-intenstive approach, would be to process the CSV and save data to your graph (using something like :Keyword and :URL nodes, and ensuring you have indexes (or unique constraints as needed) on both). Avoid doing any collect() or UNWIND operations here (except maybe for UNWIND of your url lists so you can MERGE on the :URL nodes).

That may look something like this (after you've created your indexes or unique constraints):

:auto using periodic commit 
load csv with headers from 'file:///df_keyword_urls.csv' as row
MERGE (k:Keyword {keyword:row.keyword})
UNWIND apoc.convert.fromJsonList(row.urls) as url
MERGE (url:URL {url:url})
MERGE (k)-[:HAS_URL]->(url) // or use CREATE if no keyword is repeated, and if urls are unique per keyword

Once your data is saved into the graph, then you can perform a separate query to use graph traversal find related keywords and intersected urls, using subqueries to scope the work needed per keyword:

MATCH (k1:Keyword)
CALL {
  WITH k1
  MATCH (k1)-[:HAS_URL]->(url)<-[:HAS_URL]-(k2:Keyword)
  WHERE id(k1) < id(k2) // this prevents symmetric results
  WITH k1, k2, collect(url) as urls
  RETURN k1.keyword as key1, k2.keyword as key2, [url IN urls | url.url] as urls_intersection
}
RETURN key1, key2, urls_intersection
1 Like
match (k:Keyword)-[a:APPEARS_IN]-(u:Url) // MATCH nodes
with collect(distinct u.id) as urls, k.id as k 
with collect({keyword:k , urls:urls}) as all_
unwind all_ as k1
    unwind all_ as k2
        with k1,k2
        where not k1=k2
        with k1,k2, apoc.coll.intersection(k1.urls,k2.urls) as intersection
        
        with k1,k2, apoc.coll.combinations(intersection,3,size(intersection)) as  url_combinations
        unwind url_combinations as comb
            with comb, size(comb) as size_urls,apoc.coll.toSet(collect(k1.keyword)+collect(k2.keyword)) as keywords
            with comb,size_urls,keywords,size(keywords) as size_keywords
            return comb as urls,size_urls,size_keywords,keywords
            order by size_keywords  desc

First , Thank you so much for giving me your time and answering me i really appreciate it and thank you so much for your explanation .
so let me explain .

i'm here trying to get all the keywords Urls .
after that i get the intersections between all the Urls , and that to get where all the pairs of keywords share one or more intersection .
then i make all combinations of these intersection , because , if k1 and K2 intersect in 5 urls per example , they also intersect in less than 4,3 and 2 , and in different combinations of these cummon urls .
then i group by those different combinations to make keywords clusters .

i know my code is slow , i know i'm not using the graph properties which i should , i should use the relationships and not hard code it this way . and that's why i brake it into smaller chunks , in which i generate just the intersections and after that the combinations , but still soo slow .

The problem is that your query isn't breaking down the work into smaller chunks. Your collect() operations require materializing all of that in memory at once, and the UNWINDs after don't help.

Too much is going on in memory, and the graphy approach is going to be more efficient both in performance and memory usage.

In my previous suggestion, the MATCH I performed is effectively finding intersections one keyword at a time, finding all other keywords that have relationships with the same urls, and the subqueries used actually break down the process work per keyword.

1 Like