I am making a social network. Each user has a rating. I have a task to compare the ratings of each user with the rating of all users on the network. The rating is a vector of numbers (for example, an array (1,3,2,5,6)). Each user has such a vector of 5 values. Each value is from 0 to 9. If two users of my network have these vectors completely the same, I need to assign the connection between them 100%. If only 4 values match, the relationship is 80%, if there is no match, then 0%. In this way, each user has a connection with each user.
Is Neo4J the database I need? Maybe I should use another one? Help me understand how I should keep this relationship? The challenge is that I will need to select a geolocation and select all users whose relationship is 80% or 60% ...
Let's assume that the number of users is 10 million. Each user must have 10 million relationships.
Hi 89,
I am assuming that the case "no match" is about 90%. You won't create a relationship between them, so that each user would have only 1 million relationships (does it feel better? ;-)
You don't need so much relationship: you need for each of the 5 dimensions 10 rating nodes: D1_0 to D1_9, .. D5_0 to D5_9
Each user is connected to 5 nodes (1 per dimension). Then you just need to calculate the jaccard similarity between 2 users:
Match (user1), match (user2)
Where user1.geo = ... anduser2.geo = ...
gds.alpha.similarity.jaccard(user1, user2) AS similarity
return user1.name, user2.name, similarity
order by similarity desc
It's pretty amazing to ask in forum of neo4j lovers if neo4j suits: They'll tell neo4j suits to open a beer! :-)
But in case of network of relationships, you will find your way in a graph database.
In case of "from no idea of graph db up to some usable results in a few hours", yes, neo4j suits!
There will always be matches. I took a vector with a size of 5, but in fact the vector will be 200. The probability of matches at least one value is very high. Even if there is no match, I need to store the value 0.
I have two options, either do the vector calculations in the sample each time, or save the calculated values as relationships between users. The first option is not suitable for a large network.
Scaling should not be the problem, if you have enough resources.
The point is "design follow fonction", which means in this case:
will you really need to get all User-to-User score?
on which of those score will you query more than one time?
You might need to evaluate the cost of "on demand" vs. "on stock" for both calculation and storage. Querying time should remain on the same level in both case.
For speed, you may want to use KNN instead of Jaccard similarity: K-Nearest Neighbors - Neo4j Graph Data Science
It's much faster (and parallelized), and uses an approximation technique to speed up the comparisons. It uses the Jaccard Similarity score, but it's implemented in a way that it should scaled to large datasets
The example deals with scalar values (age). Can this method be applied to vectors? I will have vectors of size 200. each value is 0 or 1. For example, a vector of size 5 will look like (10010). Each user will have a vector (size 200). I need to find the similarity of these vectors.
Yes - KNN will work with vectors.
I took an example from the documentation and inserted a vector instead of age.
CREATE (alice:Person {name: 'Alice', age: [1,1,0,0,1]})
CREATE (bob:Person {name: 'Bob', age: [1,0,0,1,1]})
CREATE (carol:Person {name: 'Carol', age: [1,0,1,0,1]})
CREATE (dave:Person {name: 'Dave', age: [0,1,0,1,0]})
CREATE (eve:Person {name: 'Eve', age: [1,1,1,1,1]});
I got the following results ...
Person1 Person2 similarity
"Alice" "Carol" 1.0
"Bob" "Alice" 1.0
"Carol" "Bob" 1.0
"Dave" "Carol" 0.5
"Eve" "Bob" 0.3333333333333333
These results do not suit me. For example, let's take the first option Alice [1,1,0,0,1] - Carol [1,0,1,0,1]. I should have gotten a value of 0.6 since I only have three matches (1-1, 1-0, 0-1, 0-0, 1-1). Perhaps this algorithm is not suitable for me? Can you suggest something for my solution?
Instead of 0, you can use -1, then the step will be 0.4
i checked out Cosine Similarity. It gives the correct result. Can this algorithm be compatible with KNN?