We have a service that process a potential "match" between a user (Profile) and a Project (a paid opportunity)
In the graph we create relationship with a score property.
The number of relationships created to the :Project node could be 20k or more
Some stats about the data
- 50k projects
- 500k profiles (growing 1000 a day)
Right now we have the following model:
MATCH (p:Project{id:""})<-[r:MATCHES]-(pm:ProjectMatch)
MATCH (profile:Profile)-[:MATCH_PROJECT]->(pm:ProjectMatch)
RETURN profile
order by r.score desc
r contains the score between the project and the profile
ProjectMatch is a node created for each month and year for a specific profile
year: 2019
month: 8
profileId: ""
We've experienced slow queries i.e to get all matches ordered by score which made us rethink the model and to potentially simplify it to just:
MATCH (p:Project{id:""})<-[r:MATCHES]-(pm:ProjectMatch)
MATCH (profile:Profile)-[:MATCH_PROJECT]->(pm:ProjectMatch)
RETURN profile
order by r.score desc
I am finding the same number of dbhits or very similar between the 2 models. Any advice?
Which data model is "better" or is supposed to perform better? Is it scalable in the long run?
Query we run
- Get all profiles returned by score desc
- Get count of all matches
- Get all profiles returned by score desc that haven't received an email
WHERE NOT ((profile)-[:HAS_EMAILS]->(:Emails)-[:SENT]->(:Email{projectId: ""}))