Finding similarity based on relationships and properties

Hello.
My goal:
Build a product recommendation system in which the choice of product to be offered to the user will be based on the user’s similarity with other users and the products they have.

To do this, I would like (please correct me if there is a more rational way):

  1. Check the user’s similarity with other users based on his properties (Age, income, etc.)
  2. Based on relationships of similar users with products, receive products that may be of interest to the user for whom we are making calculations.

I built a simple graph to explore how such a process could be built.
It consists of:

  • Users
  • Several cards presented in the form of products
  • Several insurances presented in the form of products

Data structure:

CREATE 
(maxDoe:AccountHolder { 
	name: "Max", 
	ekbId: "1001", 
	phone: "", 
	address: "Lisabon", 
	gender: "male", 
	creditLimit: 100000, 
	age: 30, 
	monthIncome: 20000 
}),
(yevheniiDoe:AccountHolder { 
	name: "Yevhenii", 
	ekbId: "1002", 
	phone: "", 
	address: "Lviv", 
	gender: "male", 
	creditLimit: 90000, 
	age: 31, 
	monthIncome: 5000 
}),
(vasiliyDoe:AccountHolder { 
	name: "Vasiliy", 
	ekbId: "1003", 
	phone: "", 
	address: "Lviv", 
	gender: "male", 
	creditLimit: 50000, 
	age: 31, 
	monthIncome: 3000 
}),
(ahmedDoe:AccountHolder { 
	name: "Ahmed", 
	ekbId: "1004", 
	phone: "", 
	address: "Dubai", 
	gender: "male", 
	creditLimit: 0, 
	age: 50, 
	monthIncome: 50000 
}),
(fredDoe:AccountHolder { 
	name: "Fred", 
	ekbId: "1005", 
	phone: "", 
	address: "Pokhara", 
	gender: "female", 
	creditLimit: 0, 
	age: 45, 
	monthIncome: 500 
}),
(insuranceProducts:InsuranceProduct {
	name: "Insurance Products", 
	activeOcagoAmount: 10, 
	activeKaskoAmount: 30, 
	activeHealth: 0, 
	totalActive: 40 
}),
(carInsurance:InsuranceCategory { 
	name: "Car Insurance", 
	categoryName: "carInsurance", 
	categoryId: 1, 
	data: "{}" 
}),
(osagoInsurance:InsuranceSubcategory{
	name: "OSAGO",
	categoryName: "carInsurance",
	categoryId: 1,
	subcategoryName: "osago",
	subcategoryId: 1,
	data: "{}"
}),
(greenCardInsurance:InsuranceSubcategory{
	name: "Green Card",
	categoryName: "carInsurance",
	categoryId: 1,
	subcategoryName: "greenCard",
	subcategoryId: 2,
	data: "{}"
}),
(kaskoInsurance:InsuranceSubcategory{
	name: "KASKO",
	categoryName: "carInsurance",
	categoryId: 1,
	subcategoryName: "kasko",
	subcategoryId: 3,
	data: "{}"
}),
(liveInsurance:InsuranceCategory{
	name: "Live Insurance",
	categoryName: "liveInsurance",
	categoryId: 1,
	data: "{}"
}),
(cardProducts:CardProduct{
	name: "Card Products",
	creditCardAmount: 0,
	debitCardAbount: 0
}),
(creditCard:Card{
	name: "Credit card",
	type: "credit",
	data: "{}"
}),
(debitCard:Card{
	name: "Debit card",
	type: "debit",
	data: "{}"
})


CREATE
	(osagoInsurance)-[:SUBCATEGORY_OF]->(carInsurance),
	(greenCardInsurance)-[:SUBCATEGORY_OF]->(carInsurance),	
	(kaskoInsurance)-[:SUBCATEGORY_OF]->(carInsurance),
	(carInsurance)-[:TYPE_OF]->(insuranceProducts),
	(liveInsurance)-[:TYPE_OF]->(insuranceProducts),
	(creditCard)-[:TYPE_OF]->(cardProducts),
	(debitCard)-[:TYPE_OF]->(cardProducts),
	(maxDoe)-[:OWNS{amount: 1}]->(kaskoInsurance),
	(maxDoe)-[:OWNS{amount: 1}]->(greenCardInsurance),
	(yevheniiDoe)-[:OWNS{amount: 1}]->(kaskoInsurance),
	(yevheniiDoe)-[:OWNS{amount: 1}]->(greenCardInsurance),
	(vasiliyDoe)-[:OWNS{amount: 1}]->(kaskoInsurance),
	(ahmedDoe)-[:OWNS{amount: 1}]->(osagoInsurance),
	(ahmedDoe)-[:OWNS{amount: 1}]->(greenCardInsurance),
	(fredDoe)-[:OWNS{amount: 1}]->(osagoInsurance),
	(maxDoe)-[:OWNS{amount: 1}]->(creditCard),
	(maxDoe)-[:OWNS{amount: 1}]->(debitCard),
	(yevheniiDoe)-[:OWNS{amount: 1}]->(debitCard),
	(vasiliyDoe)-[:OWNS{amount: 1}]->(creditCard),
	(ahmedDoe)-[:OWNS{amount: 1}]->(creditCard),
	(ahmedDoe)-[:OWNS{amount: 1}]->(debitCard),
	(fredDoe)-[:OWNS{amount: 1}]->(debitCard)

I started studying the "Node Similarity", but so far I can't adapt the examples from it to my graph.

Perhaps you will have recommendations for good articles or examples of implementing a similar idea.

I would be grateful for any feedback.
Thank you.

The article defines a pair-wise similarity metric equal to the ratio of the number of entities in common to the number of total items each entities is related to. This is a nice metric because it can be calculated with cypher, so you don't need the GDS library. The benefit is that you can calculate the metric directly on your graph instead of the projection needed by the GDS library.

You can apply this metric to estimating the similarity between each pair of account holders based on what items they own. You can expand the metric to include other items the account holders have in common if you believe they are relevant to their similarity.

The following calculates the Jaccard metric between each pair of users as defined in the article:

match(n:AccountHolder)-[:OWNS]->(p)<-[:OWNS]-(m:AccountHolder)
with n, m, count(p) as commonCount
with n, m, commonCount, count{(n)-[:OWNS]->(p)} as nCount, count{(m)-[:OWNS]->(p)} as mCount
return n.name, m.name, commonCount, nCount, mCount, toFloat(commonCount) / toFloat(nCount+mCount-commonCount) as similarity

As a very simple recommendation engine using this data, you could recommend products similar account holders own that the account holder does not already own. The following query does this for each account holder by filtering out the other account holders that have a similarity score less than 0.5 (you can use any threshold you find accurate) and then finding the collection of products the similar account holders own that aren't own by the account holder.

match(n:AccountHolder)-[:OWNS]->(p)<-[:OWNS]-(m:AccountHolder)
with n, m, count(p) as commonCount
with n, m, commonCount, 
  collect{match(n)-[:OWNS]->(p) return p} as np, 
  collect{match(m)-[:OWNS]->(p) return p} as mp
with n, m, commonCount, np, size(np) as nCount, mp, size(mp) as mCount
with n, m, np, mp, toFloat(commonCount) / toFloat(nCount+mCount-commonCount) as similarity
with n, m, similarity, [i in mp where not i in np] as otherProducts
where similarity > 0.5
unwind otherProducts as recommendedProduct
with n, collect(distinct recommendedProduct.name) as recommendations
return n.name as name, recommendations

As you can see, the only three account holders that had similarity with other account holders greater than 0.5 are shown with the other account holder's products that the account holder did not already have.

This is an example approach. You may have something more complicated in mind. I did this more to demonstrate the techniques in cypher.

In regards to using properties to estimate similarity, this may be harder and less performant. You will need to define what it means for two account holders to be similar by age, i.e. they are similar if the difference in their ages is plus/minus X years. You can do the same for all numeric properties, such as income. You could add these similarity estimates as constraints to what I did above. There are lots of possibilities here.

1 Like

Thanks, that's working in the right way.

I'm curious if, instead of comparing the properties of nodes, I create cohorts of account holders as separate nodes:

  • Cohort 1: Income from X to Y
  • Cohort 2: Income from Y to Z
  • Cohort 3: Age from N to M
  • Etc

After I'll create relationships between cohorts and account holders and apply the principle of calculating the Jaccard metric.

It seems to me that I will be able to get the similarity of account holders both in terms of products owned by them and in terms of their properties presented in the form of cohorts.

That is exactly correct. The cohorts classify the user and then the cohorts can be used in the Jaccard metric as the products were in my example. You can combine the two by Adding the cohort relationship types to the match statement, or use a UNION query.

The cohort approach will be much more performant than using the properties directly as I suggested.

Great idea.

1 Like