How to find all combinations of two nodes that both have a relation to more than a certain amount of nodes

Thomas · March 27, 2019, 11:50am

Hi everyone,

I'm currently starting to learn how Cypher works and I got stuck trying to figure out a (for a beginner) more complex Cypher.

The database has nodes for Customers, Purchases, Products and Stores, with the following relations:
graph

There are a few millions purchases made by a few thousand people with about 20 thousand different products.

What I'm trying to do is find all combinations of two customers that have purchased at least 3 of the same products.
I have googled around a lot, but since I'm new to Cypher I can't find the right terms to search for to find similar queries that could help me out.

I did find this other topic that deals with the same kind of query, but the amount of nodes is a lot smaller and it has one less relationship. That query is about "Person -> Movie <- Person" with pairs, so the equivalent would be something like "Person -> Movie -> Genre <- Movie <- Person" and then finding the people that acted in at least 3 of the same genres.

I started with trying the following:

MATCH (c1:Customer)<-[:MADE_BY]-(p1:Purchase)-[:CONSISTS_OFF]->(product:Product)<-[:CONSISTS_OFF]-(p2:Purchase)-[:MADE_BY]->(c2:Customer)
WHERE id(c1) < id(c2)
WITH c1, c2, collect(product) as commonProducts
WHERE size(commonProducts) >= 3
RETURN c1, c2, commonProducts

but prepending it with EXPLAIN gives the following:

Which are a lot of rows (200 million+). I'm not seeing how I can optimize it and I couldn't find the right info to do so myself.

Could anyone help out or point me in the right direction so I can learn how to do this?

Thank you very much.

jennifer_reif · May 21, 2019, 4:26pm

Hi, Thomas! Sorry this reply is so late.

Do you have any indexes on node properties? If you have indexes on properties that you know you will query often, then it can return results faster. For instance, if you will query productName frequently, then it would help performance to index that node property.
Also, you don't need to compare the ids of the c1 node with the c2 node. The reason for this is that Cypher will not hit the same node twice within the same query pattern. To clarify, since you specified your entire pattern in your first MATCH (didn't break it into 2 matches), then Cypher will not loop over c1 after it has already touched that node in this pattern.
Finally, if you can filter down what you are passing and returning, then that will help the return. Part of the problem is that your query is returning entire nodes (visual representation), and JavaScript will only render so much data in the browser. Since you have a lot of data, the actual limitation of JS rendering might be the problem. Could you try returning property values instead? I've put a query below that has some dummy properties, since I'm not sure of your properties in your specific data model.

MATCH (c1:Customer)<-[:MADE_BY]-(p1:Purchase)-[:CONSISTS_OFF]->(product:Product)<-[:CONSISTS_OFF]-(p2:Purchase)-[:MADE_BY]->(c2:Customer)
WITH c1.customerName as c1Name, c2.customerName as c2Name, collect(product.productName) as commonProducts
WHERE size(commonProducts) >= 3
RETURN c1Name, c2Name, commonProducts

So, I think adding indexes, removing the id comparison, and reducing what JS has to render should help. Could you let me know if you see good improvement? If not, I can continue to research ways to further optimize it.

Cheers,
Jennifer

andrew_bowman · May 21, 2019, 8:42pm

This is a fairly tough query. For this one, you'll want to aggregate based on nodes, not their properties, save property projection until later.

Also you'll want to collect(DISTINCT product) to avoid counting the same product multiple times, for products that occurred in multiple purchases by the same person.

MATCH (c1:Customer)<-[:MADE_BY]-()-[:CONSISTS_OFF]->(product:Product)<-[:CONSISTS_OFF]-()-[:MADE_BY]->(c2:Customer)
WITH c1, c2, collect(DISTINCT product) as commonProducts
WHERE size(commonProducts) >= 3
RETURN c1.customerName as c1Name, c2.customerName as c2Name, [product in commonProducts | product.productName] as commonProducts

An alternate approach is to aggregate distinct products per customer, collect then unwind the results (such that we have a cross product of every customer with every other customer) then do an intersection of products bought and only keep those with 3 or more. I can't tell off hand whether this is more or less efficient, you'll have to try it out:

MATCH (c1:Customer)<-[:MADE_BY]-()-[:CONSISTS_OFF]->(product:Product)
WITH c1, collect(DISTINCT product) as products
WITH collect({cust:c1, products:products}) as all
UNWIND all as c1Data
UNWIND all as c2Data
WITH c1Data, c2Data
WHERE c1Data.cust <> c2Data.cust
WITH c1Data.cust as c1, c2Data.cust as c2, apoc.coll.intersection(c1Data.products, c2Data.products) as common
WHERE size(common) >= 3
RETURN c1.customerName as c1Name, c2.customerName as c2Name, common

mojojo7002 · June 23, 2020, 6:22pm

Hello All,
I am working on kind of similar database structure with Customers interacting with products. I used the same query as above and for 10 customers making all the combinations and getting the results of common product took me 40 sec.
The problem is I have 38000 distinct Customers and doing it for all will take me hours to generate results. Is it normal? Should it not take this much time. If so how can I optimize it? Note that I have already created index on CustomerID and ProductID.

Topic		Replies	Views
Cypher: Match nodes that share the exact same relationships Neo4j Graph Platform migrated	2	134	January 5, 2023
Query Slows down when nodes at double depth relationship are accessed Operations performance , cypher	9	1145	November 28, 2019
Cypher Query taking time Cypher	5	504	May 13, 2020
“Query Optimization” : Neo4j Query builds a cartesian product between disconnected patterns - Cypher performance , cypher , relationship	3	1835	October 5, 2020
Cypher pattern for getting self related nodes Cypher cypher	0	821	April 15, 2020

July Summer Fun!

How to find all combinations of two nodes that both have a relation to more than a certain amount of nodes

Related topics