How to find all combinations of two nodes that both have a relation to more than a certain amount of nodes

Hi everyone,

I'm currently starting to learn how Cypher works and I got stuck trying to figure out a (for a beginner) more complex Cypher.

The database has nodes for Customers, Purchases, Products and Stores, with the following relations:
graph

There are a few millions purchases made by a few thousand people with about 20 thousand different products.

What I'm trying to do is find all combinations of two customers that have purchased at least 3 of the same products.
I have googled around a lot, but since I'm new to Cypher I can't find the right terms to search for to find similar queries that could help me out.

I did find this other topic that deals with the same kind of query, but the amount of nodes is a lot smaller and it has one less relationship. That query is about "Person -> Movie <- Person" with pairs, so the equivalent would be something like "Person -> Movie -> Genre <- Movie <- Person" and then finding the people that acted in at least 3 of the same genres.

I started with trying the following:

MATCH (c1:Customer)<-[:MADE_BY]-(p1:Purchase)-[:CONSISTS_OFF]->(product:Product)<-[:CONSISTS_OFF]-(p2:Purchase)-[:MADE_BY]->(c2:Customer)
WHERE id(c1) < id(c2)
WITH c1, c2, collect(product) as commonProducts
WHERE size(commonProducts) >= 3
RETURN c1, c2, commonProducts

but prepending it with EXPLAIN gives the following:

Which are a lot of rows (200 million+). I'm not seeing how I can optimize it and I couldn't find the right info to do so myself.

Could anyone help out or point me in the right direction so I can learn how to do this?

Thank you very much.

Hi, Thomas! Sorry this reply is so late.

  1. Do you have any indexes on node properties? If you have indexes on properties that you know you will query often, then it can return results faster. For instance, if you will query productName frequently, then it would help performance to index that node property.

  2. Also, you don't need to compare the ids of the c1 node with the c2 node. The reason for this is that Cypher will not hit the same node twice within the same query pattern. To clarify, since you specified your entire pattern in your first MATCH (didn't break it into 2 matches), then Cypher will not loop over c1 after it has already touched that node in this pattern.

  3. Finally, if you can filter down what you are passing and returning, then that will help the return. Part of the problem is that your query is returning entire nodes (visual representation), and JavaScript will only render so much data in the browser. Since you have a lot of data, the actual limitation of JS rendering might be the problem. Could you try returning property values instead? I've put a query below that has some dummy properties, since I'm not sure of your properties in your specific data model.

MATCH (c1:Customer)<-[:MADE_BY]-(p1:Purchase)-[:CONSISTS_OFF]->(product:Product)<-[:CONSISTS_OFF]-(p2:Purchase)-[:MADE_BY]->(c2:Customer)
WITH c1.customerName as c1Name, c2.customerName as c2Name, collect(product.productName) as commonProducts
WHERE size(commonProducts) >= 3
RETURN c1Name, c2Name, commonProducts

So, I think adding indexes, removing the id comparison, and reducing what JS has to render should help. Could you let me know if you see good improvement? If not, I can continue to research ways to further optimize it. :slight_smile:

Cheers,
Jennifer

This is a fairly tough query. For this one, you'll want to aggregate based on nodes, not their properties, save property projection until later.

Also you'll want to collect(DISTINCT product) to avoid counting the same product multiple times, for products that occurred in multiple purchases by the same person.

MATCH (c1:Customer)<-[:MADE_BY]-()-[:CONSISTS_OFF]->(product:Product)<-[:CONSISTS_OFF]-()-[:MADE_BY]->(c2:Customer)
WITH c1, c2, collect(DISTINCT product) as commonProducts
WHERE size(commonProducts) >= 3
RETURN c1.customerName as c1Name, c2.customerName as c2Name, [product in commonProducts | product.productName] as commonProducts

An alternate approach is to aggregate distinct products per customer, collect then unwind the results (such that we have a cross product of every customer with every other customer) then do an intersection of products bought and only keep those with 3 or more. I can't tell off hand whether this is more or less efficient, you'll have to try it out:

MATCH (c1:Customer)<-[:MADE_BY]-()-[:CONSISTS_OFF]->(product:Product)
WITH c1, collect(DISTINCT product) as products
WITH collect({cust:c1, products:products}) as all
UNWIND all as c1Data
UNWIND all as c2Data
WITH c1Data, c2Data
WHERE c1Data.cust <> c2Data.cust
WITH c1Data.cust as c1, c2Data.cust as c2, apoc.coll.intersection(c1Data.products, c2Data.products) as common
WHERE size(common) >= 3
RETURN c1.customerName as c1Name, c2.customerName as c2Name, common
1 Like

Hello All,
I am working on kind of similar database structure with Customers interacting with products. I used the same query as above and for 10 customers making all the combinations and getting the results of common product took me 40 sec.
The problem is I have 38000 distinct Customers and doing it for all will take me hours to generate results. Is it normal? Should it not take this much time. If so how can I optimize it? Note that I have already created index on CustomerID and ProductID.