Is it better to have many different relationship types or one relationship with properties?


(Awu) #1

Hi - New to Graph and would like to learn more about modeling and design.

How would you best model an employee to company relationship, where you have a Company entity and a Person entity?

Would it be better to have

  1. MATCH (n:Person)-[r:EMPLOYEE]->(m:Company) WHERE r.occupation = 'Janitor' RETURN n, r, m
  2. MATCH (n:Person)-[r:JANITOR]->(m:Company) RETURN n, r, m

Is there a threshold for which there are too many relationship types between two nodes? Or is the database better optimized for relationships versus properties on relationships?

Thanks in advance for your help.

(Stefan Armbruster) #2

In most cases having more specific relationship types is preferrable over using generic ones. However it's (in most cases) an antipattern to encode instance identifiers into a relationship type.

The reason for this is performance. In your example you need to iterate over all relationships and load the properties for each. This means 2 IO accesses for each. If you can be selective on relationship type instead of property, you only have one IO access.
On dense nodes it's even more of a difference since Neo4j maintains separate relationship chains for each relationship type.

The standard store format of neo4j allows for 65k different relationship types.

(Awu) #3

Thanks @stefan.armbruster for the quick response!

Sounds like a classic situation where I'd give up readability and design to gain some performance improvements. It makes sense from the IO access perspective. Whether it is a better practice to proliferate with multiple relationship types versus one relationship with multiple properties is still a bit murky, but I'll try out both.

This discussion brought up another idea though, whether having multiple Entity types would be beneficial. To wit,

  1. MATCH (n:Person)-[r:JANITOR]->(m:Company) RETURN n,r,m
  2. MATCH (n:Janitor)-[r:EMPLOYEE]->(m:Company) RETURN n,r,m

and I exclude
3) MATCH(n:Person)-[r:EMPLOYEE]->(m:Company) where n.occupation = 'Janitor' RETURN n,r,m for similar reasons as above.

How do most people design their graph databases when trading off against performance? Are the delays negligible initially so it's really a matter of developer's preference? How will they fare at scale?

Thanks again.

(Stefan Armbruster) #4

Classic consulting answer "it depends".
If you consider janitor being a subclass of person you might assign two labels to that node (p:Person:Janitor).
I assume in your case janitor is only a valid concept in the context of a company, so I'd go with alternative 1). But - as said - it depends on the domain and your understanding of it.

(Mike R Black) #5

Another thing to also consider is what I call "Lazy Conversations". Take the email data model example that has been used many times as a graph example. We know we don't do: (user)-[emails]->(user) but that's actually a pitfall of lazy speech. We know it's a much more extensible model to do: (user)-[sends]->(email)-[to]->(user).

In your example, would occupation actually be another node: (user)-[has]->(occupation)-[employed at])->(company)? I would imagine a person could have more than one occupation/job role at a company or at multiple companies concurrently. Then it's just a matter of writing cypher optimized for the traversal to match the pattern of data you're looking for and you'll get the performance you expect from a graph db.

(Awu) #6 - This is great. Thank you.

It seems as if there's another possibility of adding a new node.

MATCH (o:occupation {type:"Janitor"})<-[:IS]-(p:Person)-[:EMPLOYEE_OF]->(m:Company)
any better than
MATCH(n:Person)-[r:EMPLOYEE]->(m:Company) where n.occupation = 'Janitor' ?

I do like how this allows for multiple roles/occupations as is mentioned and the cypher query is easier to understand.