Optimize Number of Possible Paths Based on Value Derived from Reduce()

tie219 · September 26, 2022, 4:21pm

Hi all,

I have a graph which traces material relationships in a manufacturing process from raw material to finished good with semi-finished good steps in between. On each edge of the graph, there are unit factors with values that range from .0001 to 1000's. I am calculating an initial unit factor (Raw Material: Finished Good) by using the reduce function to multiply the values across the edges.

A raw material can have a relationship to thousands of finished goods. This calculation is painfully slow as you can imagine. I already took some steps to 'optimize' the query by removing edges with less than a .000001 value.

My thought is, I can further speed up the query but ignoring paths where the sum of the reduce function becomes less than .000001. You can imagine if a path has 3 edges in succession with .01, then you would arrive at an initial unit factor of .000001 which I wouldn't want in my end result. If I can flag these prior to Neo4j traversing the paths, then I 'think' I can save significant amounts of solve time.

I am a neo4j user of about 6 months and have a lot of to learn. Your help and expertise would be very appreciated,

sfg represents a list of materials I am tracing to an end finished good. Here is my code as it stands:

CALL {
WITH sfg
MATCH p = (input_sku:SKU{ID:sfg})-[r:GOES_INTO_B_UF|TRANSFERS_TO_B_UF*1..10]->(output_sku:SKU)
MATCH m = (mat:Material)-[:BASE_MATERIAL_OF]->(output_sku)
MATCH (output_sku)-[:MADE_AT]->(pl:Plant)
WHERE none(rel in relationships(p) where startNode(rel) = endNode(rel)) AND mat.MATERIAL_TYPE = 'FERT' AND output_sku.PLANT_ID = pl.PLANT_ID
RETURN input_sku.ID as COMPONENT_NAME,input_sku.UOM as COMPONENT_UOM,reduce(res = 1.00, r IN relationships(p) | res*tofloat(r.UF)) as SUM_UF,output_sku.ID AS OUTPUT_NAME,output_sku.UOM as OUTPUT_UOM
}
WITH COMPONENT_NAME,COMPONENT_UOM,SUM(SUM_UF) as UNIT_FC,OUTPUT_NAME,OUTPUT_UOM
WHERE UNIT_FC > .0000001
RETURN DISTINCT  COMPONENT_NAME,COMPONENT_UOM,UNIT_FC,OUTPUT_NAME,OUTPUT_UOM

glilienfield · September 26, 2022, 7:33pm

The first observation I have is with the use of the three sequential MATCH clauses. The result of this is the Cartesian product of the results from each match statement. This is not relevant if each of the match results in one row. Not knowing your data, I would look at implementing it differently. As I see it, the second and third MATCH statements are being used as a constraint, since you don't reference any of their variable in the later part of the query. If my interpretation is true, you can convert those to a predicate using the EXISTS clause. This eliminates the Cartesian product issue and would improve performance if the those two matches result in multiple rows. Also, the Cartesian product can create unintentional duplicate values in your result that requires using DISTINCT to eliminate them. The following query is a refactor of yours with this change.

CALL {
WITH sfg
MATCH p = (input_sku:SKU{ID:sfg})-[r:GOES_INTO_B_UF|TRANSFERS_TO_B_UF*1..10]->(output_sku:SKU)
WHERE none(rel in relationships(p) where startNode(rel) = endNode(rel)) 
AND EXISTS {
    MATCH (mat:Material)-[:BASE_MATERIAL_OF]->(output_sku)
    WHERE mat.MATERIAL_TYPE = 'FERT' 
}
AND EXISTS {
    MATCH (output_sku)-[:MADE_AT]->(pl:Plant)
    WHERE output_sku.PLANT_ID = pl.PLANT_ID
}
RETURN input_sku.ID as COMPONENT_NAME, input_sku.UOM as COMPONENT_UOM, reduce(res = 1.00, r IN relationships(p) | res*tofloat(r.UF)) as SUM_UF, output_sku.ID AS OUTPUT_NAME, output_sku.UOM as OUTPUT_UOM
}
WITH COMPONENT_NAME, COMPONENT_UOM, SUM(SUM_UF) as UNIT_FC,O UTPUT_NAME, OUTPUT_UOM
WHERE UNIT_FC > .0000001
RETURN DISTINCT COMPONENT_NAME, COMPONENT_UOM,UNIT_FC, OUTPUT_NAME, OUTPUT_UOM

My second observation is about the 'none' predicate. This will always be true unless you have self-referencing relationships, i.e. a relationship that relates a node to itself. Do you have any of these? If not, you should be able to safely remove the predicate in line 4.

Next, the CALL subquery is probably unnecessary in this cause. You should be able to remove it and change the RETURN clause in line 13 to a WITH and get the same results. Even so, it does not harm.

Next, the predicate 'output_sku.PLANT_ID = pl.PLANT_ID' looks like an inner join condition used in relational databases. I would figure that the existence of the :MADE_AT relationship is satisfactory.

Finally, you may be able to get rid of the DISTINCT after the refactoring out the sequential MATCH statements. That may been the cause for you to use it.

Topic		Replies	Views
Return Start and End Nodes in a path while using reduce function Neo4j Graph Platform migrated	1	166	June 16, 2022
Retrieve all the distinct nodes and relationships from list of paths Cypher performance , cypher , paths	2	1808	July 10, 2023
The only path Cypher	1	156	July 2, 2021
Match query with relationship is taking too long to retrieve results does it mean we need to upgrade Neo4j Graph Platform migrated	11	281	January 5, 2023
Filtering out specific relationships in a path / Aggregation of relationship properties Cypher cypher	5	971	June 22, 2023

Optimize Number of Possible Paths Based on Value Derived from Reduce()

Related topics