We are trying to implement a very simple content based recommendation query for our database.
The database for now only includes the following nodes and relationships:
Nodes & properties:
(Course {course_id})
(Game {game_id})
(Coach {coach_id})
Relationships:
(:Course)-[:BELONGSTO]->(:Game)
(:Coach)-[:PUBLISHED]->(:Course)
And given a course_id we want to travers the graph to RETURN 20 other courses to recommend to the user.
After researching, testing and evaluating the profile(db hits, etc) with different approaches we got the following query:
Query MATCH (c:Course {course_id: $course._id}) CALL apoc.path.subgraphNodes(c, {limit: 50,minLevel: 1,maxLevel: 2}) YIELD node WITH node WHERE labels(node)=["Course"] RETURN node LIMIT 20
The limit:50 for the apoc path expander is more a hack to prevent traversing all possibilities, but we cannot limit 20 at that point since it's taking into account all nodes involved (Game, Coach and Course).
So the "problem" for us is that, doing it this way subgraphNodes looks like is trying first one path which in our case is always the one that relates the specific course with the coach: (:Coach)-[:PUBLISHED]->(:Course) and therefore is giving back only courses from that specific coach.
We never get to the point where it recommends courses from the same game but a different coach.
Why subgraphNodes is always using that first path, is it related with the order of the id's?
And we would like to get roughly the same amount per each possible path, but we couldn't find a way for that.
If anyone knows a better approach we would be much appreciated and thank you so much for your time
Since the focus is on a particular game and associated courses and coaches:
MATCH (g:Game)
CALL apoc.path.subgraphNodes(g, {})
YIELD node
RETURN node
Result: as above picture
In apoc.path.subgraphNodes you can add labelFilter and relationshipFilter:
MATCH (g:Game)
CALL apoc.path.subgraphNodes(g, {labelFilter:'Course'})
YIELD node
RETURN node
And our start point would have to be a specific course instead of the game, using the course_id property. This would be I think the best query for the path with max of two hops, considering there would be more games, coaches and courses.
Try this:
MATCH (g:Course {course_id: "5ece57514bafe30004054b70"})
CALL apoc.path.subgraphNodes(g, {labelFilter:'Coach|Game|Course', maxLevel: 2, limit:20})
YIELD node
RETURN node
This should show all the courses connected to Game node.
You can get same results with this Cypher:
match (c:Coach)-[]-(b:Course)-[]-(g:Game)
where b.course_id = 40
match (d:Coach)-[]-(e:Course)-[]-(g)
Where d.coach_id <> c.coach_id and e.course_id <> b.course_id
return c, b, g, d, e
When using subgraphNodes(), it's using NODE_GLOBAL uniqueness during traversal, meaning that once you find a node in the graph, that node will never be returned from any other path. Once it's visited, it will not be visited by another means.
So once courses are visited via one path (from a specific coach), then it won't ever be returned by any other path. The first one wins.
It also uses breadth-first expansion by default, so the closest nodes will be found before those at a longer distance away. You can use bfs:false in the config map to change that to a depth-first expansion, which may find nodes at a longer distance away (though this is not guaranteed).