cancel
Showing results for 
Search instead for 
Did you mean: 

Join the community at Nodes 2022, our free virtual event on November 16 - 17.

Data model for creating product nodes linked by customers?

trent_fowler
Node Clone

I am trying to create a graph where nodes are products and they're linked by the ids of customer who've bought both products. The basic idea is contained in "Graph-Based Structures for the Market Baskets Analysis", and the key figure is here:

Eventually I want to assign a weight to the relationships based on how many customers bought two products, probably with a Jaccard algorithm in the data science library.

It wasn't hard to create a radial structure:

LOAD CSV WITH HEADERS FROM 'file:///BV-1350-Market_Basket_Input_Data.csv' AS line
WITH
	line.HUB_PRODUCT_KEY AS HUB_PRODUCT_KEY,
	line.HUB_CUSTOMER_KEY AS HUB_CUSTOMER_KEY

MERGE (p:Product {hub_product_key: HUB_PRODUCT_KEY})
MERGE (c:Customer {hub_product_key: HUB_PRODUCT_KEY,
                  hub_customer_key: HUB_CUSTOMER_KEY})

WITH p,c

MATCH (p),(c)
WHERE
    p.hub_product_key = c.hub_product_key

MERGE (p)-[:BOUGHT_BY]->(c)
RETURN p,c

And my hope was that I'd be able to create a subgraph consisting of products linked by customers who'd bought more than one product:

MATCH path=(p1:Product)-[:BOUGHT_BY]->(c1:Customer)<-[:BOUGHT_BY]-(p2:Product)
RETURN path

This returned no changes, no records, even though I've verified in Snowflake that there are customers who have bought more than one product.

I don't think this is the best approach anymore, but I include it for completeness in case it's helpful.

Then I thought I might create product nodes with a 'customer id' property containing a list of all the customers who've bought that product:

LOAD CSV WITH HEADERS FROM 'file:///BV-1350-Market_Basket_Input_Data.csv' AS line

MERGE (p:Product {hub_product_key: line.HUB_PRODUCT_KEY})
ON CREATE SET p.hub_customer_key = [line.hub_customer_key]
ON MATCH SET p.hub_customer_key = p.hub_customer_key + line.hub_customer_key

With this I would create some sort of relationship which looks at the list of customer ids for products and links them that way.

I've since discovered that Neo4j hates you trying to put more than one value in a property, and at any rate I keep getting Property values can only be of primitive types or arrays thereof errors.

This still seems like the best approach, but I could be wrong.

Finally, I tried loading product nodes in with the customer ids as properties, which I'd use to make links between products by matching on the customer ids, and this tanked the database so hard I have to completely reset it:

LOAD CSV WITH HEADERS FROM 'file:///BV-1350-Market_Basket_Input_Data.csv' AS line
WITH
	line.HUB_PRODUCT_KEY AS HUB_PRODUCT_KEY,
	line.HUB_CUSTOMER_KEY AS HUB_CUSTOMER_KEY

MERGE (p:Product {hub_product_key: HUB_PRODUCT_KEY,
                  hub_customer_key: HUB_CUSTOMER_KEY})

WITH p

MATCH (p1:Product), (p2:Product)
WHERE
    p1.hub_customer_key = p2.hub_customer_key AND
    NOT p1.hub_product_key = p2.hub_product_key

MERGE (p1)-[:CUSTOMER {customer_key:p1.hub_customer_key}]->(p2)
RETURN p1, p2

I'm not sure why, because when I inspect the product nodes they each contain one customer id each, so I don't know why there'd be so much draw it'd crash my local instance.

So I guess my question is: given what I've tried so far, how can I set up a graph of products that are connected by customers who've purchased both products such that I can then assign weights between the products.

2 REPLIES 2

trent_fowler
Node Clone

@alicia.frame1 Given your excellent webinars on graph algorithms this seems like something you'd know how to solve.

You might want to borrow the architecture I use for my retail demos with neo4j: GitHub - AliciaFrame/GDS_Retail_Demo: This repo contains the data and queries from the Neo4j Connect...



(you don't necessarily need the BOUGHT relationship, but I added it in to make running certain algorithms easier).

The general schema is: one node per customer, one node per product, and you can either have a transaction node in the middle ((:Customer)-[:MADE_TRANSACTION]->(:transaction)-[:CONTAINS]->(:Item)) or if you want a simpler schema, just (:Customer)-[:BOUGHT]->(:Item))

Once you've got your data in Neo4j, you can either modify the co-purchasing graph with cypher (adding a PURCHASED_SAME_ITEM relationship) or you can use the collapsePath algorithm to just modify the in-memory graph.

The cypher query is basically:

MATCH (c1:Customer)-[:BOUGHT]->(i:Item)<-[:BOUGHT]-(c2:Customer)
WITH c1, c2, count(i) as weight
MERGE (c1)-[:BOUGHT_SAME_ITEM{weight:weight}]->(c2)


Another approach you could take (there are a lot!) is to run node similarity to create SIMILAR_TO relationships (using Jaccard) between customers based on the number of items they purchased in common.