Hi - New to Graph and would like to learn more about modeling and design.
How would you best model an employee to company relationship, where you have a Company entity and a Person entity?
Would it be better to have
MATCH (n:Person)-[r:EMPLOYEE]->(m:Company) WHERE r.occupation = 'Janitor' RETURN n, r, m
or
MATCH (n:Person)-[r:JANITOR]->(m:Company) RETURN n, r, m
Is there a threshold for which there are too many relationship types between two nodes? Or is the database better optimized for relationships versus properties on relationships?
In most cases having more specific relationship types is preferrable over using generic ones. However it's (in most cases) an antipattern to encode instance identifiers into a relationship type.
The reason for this is performance. In your example you need to iterate over all relationships and load the properties for each. This means 2 IO accesses for each. If you can be selective on relationship type instead of property, you only have one IO access.
On dense nodes it's even more of a difference since Neo4j maintains separate relationship chains for each relationship type.
The standard store format of neo4j allows for 65k different relationship types.
Sounds like a classic situation where I'd give up readability and design to gain some performance improvements. It makes sense from the IO access perspective. Whether it is a better practice to proliferate with multiple relationship types versus one relationship with multiple properties is still a bit murky, but I'll try out both.
This discussion brought up another idea though, whether having multiple Entity types would be beneficial. To wit,
MATCH (n:Person)-[r:JANITOR]->(m:Company) RETURN n,r,m
or
MATCH (n:Janitor)-[r:EMPLOYEE]->(m:Company) RETURN n,r,m
and I exclude
3) MATCH(n:Person)-[r:EMPLOYEE]->(m:Company) where n.occupation = 'Janitor' RETURN n,r,m for similar reasons as above.
How do most people design their graph databases when trading off against performance? Are the delays negligible initially so it's really a matter of developer's preference? How will they fare at scale?
Classic consulting answer "it depends".
If you consider janitor being a subclass of person you might assign two labels to that node (p:Person:Janitor).
I assume in your case janitor is only a valid concept in the context of a company, so I'd go with alternative 1). But - as said - it depends on the domain and your understanding of it.
Another thing to also consider is what I call "Lazy Conversations". Take the email data model example that has been used many times as a graph example. We know we don't do: (user)-[emails]->(user) but that's actually a pitfall of lazy speech. We know it's a much more extensible model to do: (user)-[sends]->(email)-[to]->(user).
In your example, would occupation actually be another node: (user)-[has]->(occupation)-[employed at])->(company)? I would imagine a person could have more than one occupation/job role at a company or at multiple companies concurrently. Then it's just a matter of writing cypher optimized for the traversal to match the pattern of data you're looking for and you'll get the performance you expect from a graph db.
It seems as if there's another possibility of adding a new node.
Is
MATCH (o:occupation {type:"Janitor"})<-[:IS]-(p:Person)-[:EMPLOYEE_OF]->(m:Company)
any better than
MATCH(n:Person)-[r:EMPLOYEE]->(m:Company) where n.occupation = 'Janitor' ?
I do like how this allows for multiple roles/occupations as is mentioned and the cypher query is easier to understand.
I know that this is a late response, but it's odd to me that this answer doesn't refer to this excellent piece of documentation:
Quote:
I ran a query against each database 100 times and then took the 50th, 75th and 99th percentiles (times are in ms):
Using a generic relationship type and then filtering by end node label
50%ile: 6.0 75%ile: 6.0 99%ile: 402.60999999999825
Using a generic relationship type and then filtering by relationship property
50%ile: 21.0 75%ile: 22.0 99%ile: 504.85999999999785
Using a generic relationship type and then filtering by end node label
50%ile: 4.0 75%ile: 4.0 99%ile: 145.65999999999931
Using a specific relationship type
50%ile: 0.0 75%ile: 1.0 99%ile: 25.749999999999872
Worse: end node label (145.6)
99%ile: 145.65999999999931 Total database accesses: 70,001
(:Person)-[:HAS]→(:Attr :Eyes {colour:"blue"})
Pretty Bad: end node property (402.6)
99%ile: 402.60999999999825 Total database accesses: 140,001
(:Person)-[:HAS]→(:Attr {type:"eyes", colour:"blue"})
Very Bad: relationship property (504.8)
99%ile: 504.85999999999785 Total database accesses: 140,001
(:Person)-[:HAS {type:"eyes"} ]→(:Attr {colour:"blue"})
NB: I refer to this often as I have been learning, I think it's a terrific summary! Though it's relatively old, and I wonder if these algos have been updated. it'd be nice to rerun these some time, ping @mark.needham
It looks like
(:City)-[:TRANSLATION { code: 'lang.code'}]->(:CityTranslation) is the Very bad performer...
so
(:City)-[:TRANSLATION]->(:CityTranslation { code: lang.code} ) will perform better for me
is it?
I was thinking of using the lang.code as a dynamic relationship but then there can be too many languages... in our use case user inupts data with language code (we provide the list of language codes)