Defining a hierarchy relationship from multiple properties

Community,

Apologies first, I'm new to GraphDB and Cypher. I did research and try several attempts to resolve this issue.

I have a self-referencing hierarchy from a legacy SQL data source which contains levels 4 - 1 with lvl4 being highest and lvl1 being the lowest. They each contain id and name properties. Each record results in a distinct path.
lvl4[A]
lvl3 [A of B]
lvl2 [A of B of C]
lvl1 [A-D]
lvl2 [A of B of C]
lvl2 [A of B of C]
lvl2 [A of B]
lvl2 [A of B]
lvl1 [G]...

Data Example

dim_id lvl1_id lvl1_name lvl2_id lvl2_name lvl3_id lvl3_name lvl4_id lvl4_name
1000 428227 Customer Item Generic 300348 Customer Item 200692 Customer Enterprise 100255 Customer Services
1001 431077 Small Customer Item Latin America 300348 Customer Item 200692 Customer Enterprise 100255 Customer Services
1002 428138 Small Customer Item US 300348 Customer Item 200692 Customer Enterprise 100255 Customer Services
1003 431073 Customer Item Priority 300348 Customer Item 200692 Customer Enterprise 100255 Customer Services
1004 431078 Customer Item Latin America 300348 Customer Item 200692 Customer Enterprise 100255 Customer Services

Can someone help me with the cypher to create a hierarchy relationship for this data?
lvl1 to lvl2 relation called tg
lvl2 to lvl3 relation called pg
lvl3 to lvl4 relation called sg
-if necessary-
lvl4 to root relation called dg

  • Neo4j Desktop 1.0.20
  • No API or Driver necessary

In your example data levl4, levl3, and lvl2 are all the same for all rows. Is that true only for the example, or are those expected to change?

Do you know if the resulting data is meant to be a tree?

If two different rows share the same name and id at a certain level, can we be assured that all levels above are also in common? For example, this example data has all rows with lvl2, lvl3, and lvl4 in common for all rows. Would there ever be a situation where multiple rows have the same lvl2 or lvl3 but the levels above them would be different? I'm asking because I'm wondering if items with the same id are meant to refer to the same node in the graph (and thus id would be unique for nodes at that level) or if we could have duplicate nodes at a given level with the same id (the id property not being unique for nodes at that level) because the nodes above them in the hierarchy are different.

It would help if you could provide the labels (node types) you want to use in the graph as well, and maybe a quick visual of the overall hierarchy.

For example, something like this, but replacing the more generic labels here with something that is a bit more specific to your domain if possible:

(:lvl1)-[:tg]->(:lvl2)-[:pg]->(:lvl3)-[:sg]->(:lvl4)

In order to be brief I only included one roll up of a tree hierarchy. In actuality the dataset contains more than 10k unique paths. So there is a lot of duplication in the higher levels. Lvl4 (div) will continue 14 unique names duplicated across 10k records. With each level containing more unique names than its parent.
Lvl4 = Div
Lvl3 = Seg
Lvl2 = Pdg
Lvl1 = Ctg

Thanks for the help so far. I hope I explained this better this time.

Thanks for the labels, that will help, though the biggest thing that's missing is information on uniqueness of nodes per level (or per label, now).

With graph databases we tend to model in a normalized fashion, so for example at the top level, for :Div nodes, we would only have 14 nodes and not do any duplication, with relationships down to nodes at the next level.

The question then is if the ids here are meant to be unique across nodes at that level (nodes with that label), or if there's going to be duplication based upon differing nodes at the hierarchy above.

As an example, would we ever have a situation where we have two (or more) identical level 3 nodes, but with different nodes at level 4? Or a situation with identical level 2 nodes, but a difference at level 3 or 4?

Or put another way, is this meant to be a tree structure, where every node in the tree only has a single parent and each id is unique to a node at the same level? Or would we have duplicates of the same node at a given level, because they have a different hierarchy above them?

It is a tree structure. If you combined lvls 4+3+2+1 it's unique. Lvl1 is completely unique, each 1 has a single parent 2 who has a single parent 3 etc. Top down 4 is the smallest set of node names but it replicated the most across the dataset.

There is a unique code associated with each row the dim_id property. That is used to identify the "relationship" from other data. I modified my earlier example to bridge multiple segments, products, ps_groups. The tree example below should match the data grid where lvl4 repeats for all 5 rows, lvl3 repeats 2x, lvl2 has only 1 repeat, and lvl1 is all unique.

dim_id lvl1_id lvl1_name lvl2_id lvl2_name lvl3_id lvl3_name lvl4_id lvl4_name
row_id psg_id psg_name ps_id ps_name seg_id seg_name div_id div_name
1000 428227 Customer Item Generic 300348 Customer Item 1 200692 Customer Enterprise 1 100255 Customer Services
1001 431077 Small Customer Item Latin America 300348 Customer Item 1 200692 Customer Enterprise 1 100255 Customer Services
1002 428138 Small Customer Item US 300349 Customer Item 2 200692 Customer Enterprise 1 100255 Customer Services
1003 431073 Customer Item Priority 300349 Customer Item 3 200693 Customer Enterprise 2 100255 Customer Services
1004 431078 Customer Item Latin America 300350 Customer Item 4 200693 Customer Enterprise 2 100255 Customer Services
DIV (Customer Services)
    SEG (Customer Enterprise 1)
        PS (Customer Item 1)
           PSG(Customer Item Generic)
           PSG(Small Customer Item Latin America)
        PS (Customer Item 2)
           PSG(Small Customer Item US)    
    SEG (Customer Enterprise 2)
       PS (Customer Item 3)
           PSG(Customer Item Priority)
       PS (Customer Item 4)
           PSG(Customer Item Latin America)

Great, so now that we know this is a tree structure, and have labels for the nodes, and know that ids are unique per label, we can create the constraints we need, then do the import.

Using the node labels you provided, let's create the following constraints (run each separately)

CREATE CONSTRAINT ON (n:Div) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Seg) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Pdg) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Ctg) ASSERT n.id IS UNIQUE

That will prevent us from accidentally created duplicate nodes, and we get an index with each of those so lookup by any of these nodes by id should be fast.

Now for the actual import. Provided this is a CSV file (in the import directory of your Neo4j home directory) we can do the following:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'myImportFile.csv' as line
MERGE (c:Ctg {id: toInteger(line.lvl1_id)})
SET c.name = line.lvl1_name
MERGE (p:Pdg {id: toInteger(line.lvl2_id)})
ON CREATE SET p.name = line.lvl2_name
MERGE (s:Seg {id: toInteger(line.lvl3_id)})
ON CREATE SET s.name = line.lvl3_name
MERGE (d:Div {id: toInteger(line.lvl4_id)})
ON CREATE SET d.name = line.lvl4_name

MERGE (c)-[:tg]->(p)
MERGE (p)-[:pg]->(s)
MERGE (s)-[:sg]->(d)

You'll want to double-check that the labels and relationship types (and directions) are correct, but the output should be your full hierarchy tree.

Thank you for the help. The constraints and merge worked perfectly for what I needed. Sorry it's been a while but other work items took priority. I'm back in the GraphDB game again and this really helps with my understanding.