Neo4j Choice Values Design Decision

performance
etl

(Ande8440) #1

TLDR: single and multi-select choice fields: for a generic ETL tool should I model them as nodes connected with relationships or direct labels on the parent/starting node. I realize the optimal modeling decision depends on the specific schema/questions being asked, but I am trying to agree upon a generic rule that is mostly right in most cases because my ETL tool won't have the necessary context about the data model it's transforming.

Detail:

I am thinking through an ETL tool/workflow that I am building that pulls from an API source to create graphs on the fly. The api generally returns data structured as individual entries that belong to "lists" such as People, Company, City, etc. with individual attributes/properties also present on the entries.

I know that I want to create node labels for each 'List', BUT the thing is that multiple list types have single and multi-select choice fields on an individual entry (such as company type for the company list. Company type options might include institution, bank, school, investor, real estate company, etc. and a company can have one or many types ).

I ideally would want a generic rule that can handle all choice fields across all lists when writing to the neo4j graph. (Another choice field is "Tier" for example, with Tier 1, Tier 2, Tier 3 and Tier 4 as potential values)

Would it be best practice to

  1. create a node label for each choice field category (i.e. a node label called "CompanyType" with X instances of that node one for each company type with a "name" property equal to the company type), and then link the Company node to its one or many CompanyType nodes via a "company_type" edge

or

  1. create a node label for each choice field option and apply that label to the Company node in question (So a Company node might then have 5-6 labels such as Company:Tier1:Tier2:Investor:Bank)

(Bratanic Tomaz) #2

This is just my personal approach how I would handle this scenario, so it might not be perfect.

I had a conversation with a Neo4j developer at the GraphConnect 2017 about how many labels should a node have. As far as I remember he told me that the performance of label search drops if you have more than 5 or 6 labels per node. So as a general rule of thumb I use less than 5 labels per node in my graph.

If you use the first option where you create a category node and a relationship between an entity and the category node you can keep track of history by saving date of creation and date of expiration as a property of relationship. You can't do this if using labels.

I mostly use multiple node labels if there are specific queries i want to optimize like for example.

(:Tier2)-[:HAS_CATEGORY]->(:Bank)

Second use case would be when for example lets say TierX label is calculated daily using some sort of scoring to define how high of a tier the company gets. This will then in turn optimize your further queries where you will not have to calculate the score of the company every time to get all Tier2 companies for example.