I would like to count, for each pair of websites, how many distinct users commented on both sites. I tried this:
Match (w1:Website)<-[:POSTED_IN]-(:Comment)<-[:POSTED]-(u:User)-[:POSTED]->(:Comment)-[:POSTED_IN]->(w2:Website)
RETURN w1.name, w2.name, Count(Distinct u) as weight
There are around 6000 Website nodes in my db, so around 36 million rows need to be returned, and this seems to cause the db to crash.
One workaround I tried was to do a single query for each Website in the db:
for w in websites:
q = '''Match (w1:Website {name:$w})<-[:POSTED_IN]-(:Comment)<-[:POSTED]-(u:User)-[:POSTED]->(:Comment)-[:POSTED_IN]->(w2:Website)
RETURN w1.name, w2.name, Count(Distinct u) as weight'''
tx.run(q, w=w)
This gets the job done, (I think), but is till very slow.
About how many comments does a user typically make per website? Unless the typical user only leaves a single comment per website, you may need to remodel this and your query for better performance.
If a user posts 20 comments on a website, in your query this will multiply out 20 times however many comments they posted on each other site, in terms of paths explored and rows generated that need to be processed with DISTINCT.
If you created and maintained these distinct paths: (:Website)<-[:POSTED_AT]-(:User), such that only one such relationship can exist between a user and a website, that should prevent those multiplicative path explorations and may increase your query efficiency.