Missing point in loading relations of relations which produces a huge amount of data

sim-jac · December 21, 2021, 11:53am

We have a starting node (called VariantConfiguration) which has a list of variants. Each Variant has a list of PartListItems and have a list of AttributeOptions, which are defined by an AttributeDefinition and an Attribute. All we want is to load a filled up VariantConfiguration.

In our example there exists only one VariantConfiguration with 3 Variants. Every Variant have 1 PartListItem (overall: 3) and 3 AttributeOptions (overall: 9). So overall there are 3 different AttributeDefinitions and 6 different Attributes. If we load it by the following query there are 9 resulting rows:

MATCH 
(config:VariantConfiguration {id : "60d3ccb0-58ff-4d0e-a25c-917fed59ff0a"})-[r_vc_v:FOUND]->(variant:Variant)

MATCH 
(variant)<-[r_v_pli:PART_LIST_ITEM_REF]-(part_list_item:PartListItem) 

MATCH 
(variant)<-[r_v_a:ATTRIBUTE_OPTIONS]-(attribute_option:AttributeOption) 
, (attribute_option)-[r_ao_a:HAS_ATTRIBUTE]->(attribute:Attribute) 
, (attribute_option)-[r_ao_d:HAS_ATTRIBUTE_DEFINITION]->(attribute_definition:AttributeDefinition)
 
RETURN config, variant, part_list_item, attribute_option, attribute_definition

Seems ok with a small amount of data but if we have a VariantConfiguration with 148 variants and each have 17 AttributeOptions we will have a result set of 10171 records. The problem seems that a carthesian product has been created above PartListItem, Attribute und AttributeDefinition, so we understand that collect and WITH need to be added to the query (Which is surprisingly unreadable with a much more complex query):

MATCH 
(config:VariantConfiguration {id : "60d3ccb0-58ff-4d0e-a25c-917fed59ff0a"})-[r_vc_v:FOUND]->(variant:Variant)

MATCH 
(variant)<-[r_v_pli:PART_LIST_ITEM_REF]-(a_part_list_item:PartListItem)  
WITH 
collect(a_part_list_item) as part_list_item,
config, variant 

MATCH 
(variant)<-[r_v_a:ATTRIBUTE_OPTIONS]-(a_attribute_option:AttributeOption) 
, (a_attribute_option)-[r_ao_a:HAS_ATTRIBUTE]->(a_attribute:Attribute) 
, (a_attribute_option)-[r_ao_d:HAS_ATTRIBUTE_DEFINITION]->(a_attribute_definition:AttributeDefinition)  
WITH 
collect(a_attribute_definition) as attribute_definition,
collect(a_attribute_option) as attribute_option,
collect(a_attribute) as attribute,
part_list_item, config, variant
 
RETURN config, variant, part_list_item, attribute_option, attribute_definition

This query return 3 records for the VariantConfiguration with 3 variants and 148 records for the VariantConfiguration with 148 records. Fine so far! But OGM does not build our models and relations if we not return the relation variables. But if we do so the carthesian product will be created again, so also all realtions need to be wrapped with collection

So there might be one missing information or one point we do no unterstand. From our point of view the loading of some node with relation nodes should not be so much complicated and not so implicit. The real application will load some more nodes (overall 16) and relations, so we have to write a query which is at least almost 200 lines long just to load some nodes from the database - and almos unreadable. And another implicit fact is that ogm will generate queries, which does not include the collection notation, so already a call with a depth of two will result in a carthesan product with a huge amount of unnecessary data and in most cases with a broken application.

So. What point do we miss?

Thanks and best regards

SamChalvet · December 21, 2021, 3:15pm

Salut Simon,

I deal with hierarchical data a lot and there is no real silver bullet solution.
There are however some best practices that can be used when designing your graph schema that help when it is time to query it. Here is a link to a post I wrote on the subject.

In your case the best solution is to write your queries in the most legible way. Yes they may be a bit long, with several COLLECTS along the way, but it will be easier to follow and read.

First question: How are you ingesting the query results? I found it easiest to return them as a nested JSON payload.

Regardless here is how I would write the query:

I start with matching all the data I want at the top.
Then in several steps, due to the hierarchical nature of the data, I use WITH clauses to roll up the data.
Finally as I said, I nest all the data in a JSON object before I return it. It makes it easier to deal with on the code side.
AND if I wrote my query correctly I always end up with just one result. If there are more than one then I review the query.

Also I would not alias the relationships (or Nodes) unless you plan to return them.

In my experience this query makes the hierarchy a bit easier to follow in debug.

MATCH (config:VariantConfiguration {id : "60d3ccb0-58ff-4d0e-a25c-917fed59ff0a"})
OPTIONAL MATCH 
    (config)-[:FOUND]->(variant:Variant),
    (variant)<-[:PART_LIST_ITEM_REF]-(a_part_list_item:PartListItem),
    (variant)<-[:ATTRIBUTE_OPTIONS]-(a_attribute_option:AttributeOption),
    (a_attribute_option)-[:HAS_ATTRIBUTE]->(a_attribute:Attribute),
    (a_attribute_option)-[:HAS_ATTRIBUTE_DEFINITION]->(a_attribute_definition:AttributeDefinition) 

// First rollup
WITH 
    config, variant, a_part_list_item, a_attribute_option,
    collect(distinct properties(a_attribute)) as HAS_ATTRIBUTE,
    collect(distinct properties(a_attribute_definition)) as HAS_ATTRIBUTE_DEFINITION 

// Second rollup
WITH 
    config, variant,
    collect(distinct
        apoc.map.setLists(properties(a_attribute_option), 
                        ['HAS_ATTRIBUTE', 'HAS_ATTRIBUTE_DEFINITION'], 
                        [HAS_ATTRIBUTE, HAS_ATTRIBUTE_DEFINITION]
        )
    ) as ATTRIBUTE_OPTIONS,
    collect(distinct properties(a_part_list_item)) as PART_LIST_ITEM_REF

// Third rollup
WITH 
    config,
    collect(distinct
        apoc.map.setLists(properties(variant), 
                        ['ATTRIBUTE_OPTIONS', 'PART_LIST_ITEM_REF'], 
                        [ATTRIBUTE_OPTIONS, PART_LIST_ITEM_REF]
        )
    ) as FOUND

// final object
WITH
    apoc.map.setLists(properties(config), ['FOUND'], [FOUND]) as variant_configuration
 
RETURN apoc.convert.toJson(variant_configuration) as variant_configuration

I hope this helps.

sim-jac · January 11, 2022, 11:56am

Hy Sam,

happy new year and sry for my late reply!

We simply use OGM as GraphMapper, so it is automatically mapped. That is why we have discovered the large amounts of data only by chance.

Thanks for your example and your blog. Sounds really helpful! I will also share it internally.

But I am still confused why it is so complicated to "simply" load nodes with relations. From my point of view this is one of the core features of neo4j-ogm and neo4j itself (But I also thought cascades should be an core feature), but it seems that it is how it is and we have do deal with :). But, to be honest, I'm not sure if we will use a graph database in the future - if there is no missing point in our understanding of how to use neo4j/ogm and cypher.

Thanks and best regards
Simon

Topic		Replies	Views
Loading in millions of nodes Import / Export performance , cypher , import	0	333	February 18, 2022
Neo4j nested relations loading using spring data repository Spring Data Neo4j & Neo4j-OGM	1	843	January 28, 2020
Cartesian product while referencing the same relation with different constraints Cypher performance , cypher	2	400	March 30, 2020
SDN findAll performance and populating entities at depth > 1 Spring Data Neo4j & Neo4j-OGM sdn	12	2280	May 16, 2022
Batch loading CSV with creating special relationship type Cypher apoc , cypher , import , neo4j-desktop	3	636	July 6, 2023

Get Certified in June!

Missing point in loading relations of relations which produces a huge amount of data

Related topics