Hello,
I want to use neo4j for user path analysis. I have events sent by users and want to query users that has sent certain event paths. For example I want to get users that has sent Event1 -> Event2 -> Event4 events in the given order. Intermediate events are not important; I mean a user that has sent Event1 -> Event2 -> Event3 -> Event4 should be included in the above query.
I have tried this model
(User1)-[:SESSION {sessionId: 1, time: 100}]->(Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})
(User1)-[:SESSION {sessionId: 1, time: 200}]->(Event {name: 'B', deviceBrand: 'BrandA', osVersion: '1.0'})
(User1)-[:SESSION {sessionId: 1, time: 300}]->(Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})
(User1)-[:SESSION {sessionId: 1, time: 400}]->(Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})
(User2)-[:SESSION {sessionId: 2, time: 110}]->(Event {name: 'A', deviceBrand: 'BrandB', osVersion: '2.0'})
(User2)-[:SESSION {sessionId: 2, time: 210}]->(Event {name: 'E', deviceBrand: 'BrandB', osVersion: '2.0'})
(User2)-[:SESSION {sessionId: 2, time: 310}]->(Event {name: 'F', deviceBrand: 'BrandB', osVersion: '2.0'})
(User2)-[:SESSION {sessionId: 2, time: 410}]->(Event {name: 'D', deviceBrand: 'BrandB', osVersion: '2.0'})
(User3)-[:SESSION {sessionId: 3, time: 120}]->(Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})
(User3)-[:SESSION {sessionId: 3, time: 220}]->(Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})
(User3)-[:SESSION {sessionId: 3, time: 320}]->(Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})
And queried like this;
MATCH (u:User)-[s1:SESSION]->(e1:Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})
MATCH (u)-[s2:SESSION]->(e2:Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})
MATCH (u)-[s3:SESSION]->(e3:Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})
WHERE s1.sessionId = s2.sessionId AND s2.sessionId = s3.sessionId
AND s1.time < s2.time AND s2.time < s3.time
RETURN DISTINCT u.id
It works fine but I am not sure that it has the best performance for millions of nodes. Would there be a better modeling for this purpose or is there any way to optimize my query.