Data Modeling For User Path Analysis

Hello,

I want to use neo4j for user path analysis. I have events sent by users and want to query users that has sent certain event paths. For example I want to get users that has sent Event1 -> Event2 -> Event4 events in the given order. Intermediate events are not important; I mean a user that has sent Event1 -> Event2 -> Event3 -> Event4 should be included in the above query.

I have tried this model

(User1)-[:SESSION {sessionId: 1, time: 100}]->(Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})
(User1)-[:SESSION {sessionId: 1, time: 200}]->(Event {name: 'B', deviceBrand: 'BrandA', osVersion: '1.0'})
(User1)-[:SESSION {sessionId: 1, time: 300}]->(Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})
(User1)-[:SESSION {sessionId: 1, time: 400}]->(Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})

(User2)-[:SESSION {sessionId: 2, time: 110}]->(Event {name: 'A', deviceBrand: 'BrandB', osVersion: '2.0'})
(User2)-[:SESSION {sessionId: 2, time: 210}]->(Event {name: 'E', deviceBrand: 'BrandB', osVersion: '2.0'})
(User2)-[:SESSION {sessionId: 2, time: 310}]->(Event {name: 'F', deviceBrand: 'BrandB', osVersion: '2.0'})
(User2)-[:SESSION {sessionId: 2, time: 410}]->(Event {name: 'D', deviceBrand: 'BrandB', osVersion: '2.0'})

(User3)-[:SESSION {sessionId: 3, time: 120}]->(Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})
(User3)-[:SESSION {sessionId: 3, time: 220}]->(Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})
(User3)-[:SESSION {sessionId: 3, time: 320}]->(Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})

And queried like this;

MATCH (u:User)-[s1:SESSION]->(e1:Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})
MATCH (u)-[s2:SESSION]->(e2:Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})
MATCH (u)-[s3:SESSION]->(e3:Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})
WHERE s1.sessionId = s2.sessionId AND s2.sessionId = s3.sessionId
  AND s1.time < s2.time AND s2.time < s3.time
RETURN DISTINCT u.id

It works fine but I am not sure that it has the best performance for millions of nodes. Would there be a better modeling for this purpose or is there any way to optimize my query.

This looks like a variable-length path to me....which can be tricky, especially with larger data sets. It's good to set some boundary, so that it doesn't trace every path pattern of any depth. Is there a limit you want to have where you look for a path that has maximum of (for instance) 5 hops with certain events along that path? Or are you wanting to find any and all paths from a user with any number of hops (difficult to make performant)?

Can you provide more details of what you are looking to achieve to better help with the data modeling and query.

In the meantime, it looks like you are using sessionId to identify a group of session relationships. I would not recommend that, as I feel it will be hard to ensure the sessionIds are kept consistent. Instead, you may consider having a session node that has a unique id so you can search for specific sessions. The session node could be related to a user through a HAS_USER relationship. The session node could have the time the session was created. Each of the events for the session would be related to the session. Each event would have the time the event occurred.

Another option is to have the events related to each other in a linked list by order of occurrence.

That being said, I am not sure which, or if either, helps you achieve your goal. My understanding is you want to find all the users who have a specific sequence of events. Unfortunately I think you will need to interrogate every path of events (if events are linked) or every collection of events (if events are not linked) associated with each user to find those that match. I feel this will not scale.

Is there a limit you want to have where you look for a path that has maximum of (for instance) 5 hops with certain events along that path?

I think it would be okay to limit based on the time passed between events. For example, I can set a limit between eventA and eventC for 24hours and between eventC and eventD 2 hours. I guess this would increase the performance since it reduces the path it can go further. But I am not sure how can I make the Neo4j to use a time attribute like a length limit. Would you have any idea about this?

Instead, you may consider having a session node that has a unique id so you can search for specific sessions.

I also thought about this and it may be a better approach as you said.

Another option is to have the events related to each other in a linked list by order of occurrence. That being said, I am not sure which, or if either, helps you achieve your goal. My understanding is you want to find all the users who have a specific sequence of events.

Yes, I want to find all the users who have a specific sequence of events. There can be millions of users returned by this query or a few. I can put a time gap limit between 2 events to reduce the path it should scan.

This linked page from the documentation talks more about this, but I would also add sessionId
as a property on your SESSION relationship. Then you could add a relationship index, which would help performance. Your new query might look something like this:

MATCH (u:User)-[s1:SESSION {sessionId: <id>}]->{1,4}(e1:Event {name: 'A', deviceBrand: 'BrandA', osVersion: '1.0'})-[s2:SESSION {sessionId: <id>}]->{1,4}(e2:Event {name: 'C', deviceBrand: 'BrandA', osVersion: '1.0'})-[s3:SESSION {sessionId: <id>}]->{1,4}(e4:Event {name: 'D', deviceBrand: 'BrandA', osVersion: '1.0'})

This would let you find a user path that hopped from Event1 -> Event2 -> Event4 within 4 hops of each other (meaning they didn't hit any more than 3 other events in between 1, 2, and 4).