Optimizing a query issue

Hello,

I am trying to optimize a cypher query. The purpose of this query is given a node "source" we want to get the number of weakly connected component connected to another node "AL". For instance, if I have the following graph

I want to create relationships from the node source to target nodes connected to the same AL node at time T and I want to add a property on this relationship which correspond to the number of connected component attached at AL node. In this example, weights from source node to target nodes will be of 2 because there are 2 wcc (the source node in itself and the target nodes).

The difficulty is that I want to do this also for another type of node, that we can call "AF", and it is time consuming...

Here is my code: (for information, I use Neo4j v.4.2.1)


CALL gds.graph.create("similarite_proj",
	 {
        D : {properties: "ORDERDATE_TIME_INT"} 
    },
	["RELATIONSHIP"]);

CALL gds.beta.graph.create.subgraph(
        "similarite_4mois",
        "similarite_proj",
        "*",
        "*"
    );
	
	
	 
CALL apoc.periodic.iterate(
'MATCH (s:D)
WHERE NOT EXISTS (s.WCC) 
RETURN s',
'
SET s.WCC = 1 // allow us to know on which node we have been through
WITH s
CALL {
WITH s
WITH s, s.ORDERDATE_TIME_INT AS sDate, toInteger(apoc.date.parse(toString(s.ORDERDATE_TIME - duration({months:4})), "s", "yyyy-MM-dd\'T\'HH:mm:ss")) AS oldDate
WITH s, sDate + " >= n.ORDERDATE_TIME_INT AND n.ORDERDATE_TIME_INT >= " + oldDate as nodeFilter // Filter for a subgraph projection

// First delete in-memory graph
// Note that this will break if no graph with that name is in memory yet
CALL gds.graph.drop("similarite_4mois")
YIELD graphName

// Projection of the subgraph with the filter 
CALL gds.beta.graph.create.subgraph(
  "similarite_4mois",
  "similarite_proj",
  nodeFilter,
  "*"
)
YIELD nodeCount

// Create the wcc property on the in memory graph(projected graph)
CALL gds.wcc.mutate("similarite_4mois", {mutateProperty: "wcc"})
YIELD componentCount

// Get the wcc number of the s node
CALL gds.graph.streamNodeProperty("similarite_4mois", "wcc")
YIELD nodeId, propertyValue
WHERE nodeId = id(s)
WITH s, propertyValue AS component_id_s
RETURN  component_id_s
}

WITH s, component_id_s

// Compute # of component id connected to the same AL
CALL {
	  
  WITH s, component_id_s

  MATCH (s)-[:A_L]->(al:AdresseLivraison)<-[:A_L]-(targets_A_L:D)
  WHERE s.ORDERDATE_TIME>=targets_A_L.ORDERDATE_TIME>=s.ORDERDATE_TIME - duration({months:4}) AND NOT EXISTS((s)-[:RELATIONSHIP]->(targets_A_L)) AND id(s) <> id(targets_A_L)

  // Allow us to filter on the component id of the targets 
  CALL gds.graph.streamNodeProperty("similarite_4mois", "wcc")
  YIELD nodeId, propertyValue AS component_id_targets_A_L
  WHERE nodeId = id(targets_A_L) AND component_id_targets_A_L <> component_id_s


  WITH collect(DISTINCT targets_A_L) as targets_A_L_list, collect(DISTINCT component_id_targets_A_L) as component_id_targets_A_L
  RETURN targets_A_L_list, size(component_id_targets_A_L)+1 as nb_wcc_A_L//+1 because we take into account the s component
}
WITH s, component_id_s, targets_A_L_list, nb_wcc_A_L
CALL{
	WITH s, targets_A_L_list, nb_wcc_A_L
    WITH s, targets_A_L_list
    UNWIND targets_A_L_list as targets_A_L
    MERGE (s)-[r:SOFT_SIMILARITE_A_L]->(targets_A_L)
    SET r.NUM_WCC_AL = nb_wcc_A_L
}

// Compute # of component id connected to the same AF
WITH s, component_id_s
CALL {
	  
  WITH s, component_id_s

  MATCH (s)-[:A_F]->(af:AdresseFacturationClean)<-[:A_F]-(targets_af:D)
  WHERE s.ORDERDATE_TIME>=targets_af.ORDERDATE_TIME>=s.ORDERDATE_TIME - duration({months:4}) AND NOT EXISTS((s)-[:RELATIONSHIP]->(targets_af)) AND id(s) <> id(targets_af)

  // A_LLow us to filter on the component id of the targets 
  CALL gds.graph.streamNodeProperty("similarite_4mois", "wcc")
  YIELD nodeId, propertyVA_Lue AS component_id_targets_af
  WHERE nodeId = id(targets_af) AND component_id_targets_af <> component_id_s


  WITH collect(DISTINCT targets_af) as targets_af_list, collect(DISTINCT component_id_targets_af) as component_id_targets_af
  RETURN targets_af_list, size(component_id_targets_af)+1 as nb_wcc_af//+1 because we take into account the s component
}
WITH s, component_id_s, targets_af_list, nb_wcc_af
CALL{
	WITH s, targets_af_list, nb_wcc_af
        UNWIND targets_af_list as targets_af
	MERGE (s)-[r:SOFT_SIMILARITE_A_F]->(targets_af)
    	SET r.NOMBRE_WCC_AF = nb_wcc_af
}


RETURN COUNT(DISTINCT s.NUM_DOSSIER)
',
{batchSize:1, parallel:false})

Thanks for your help !

Speaking only to one very specific part of your query...

The way you are using apoc.periodic.iterate() right now is problematic. By returning node or relationship variables directly like this (you return node variable s) from the driving query to the batching query, you're effectively disabling the batching, forcing the changes to apply within a single large transaction.

Here's best practice to avoid this:

Thanks for your response @andrew_bowman.

The complexity in my code is mainly due to the CALL gds.graph.streamNodeProperty("similarite_4mois", "wcc") which is called two times and the fact that I have multiple CALL{}.

So the main issue in the code is this part:

//First CALL
// Compute # of component id connected to the same AL
CALL {
	  
  WITH s, component_id_s

  MATCH (s)-[:A_L]->(al:AdresseLivraison)<-[:A_L]-(targets_A_L:D)
  WHERE s.ORDERDATE_TIME>=targets_A_L.ORDERDATE_TIME>=s.ORDERDATE_TIME - duration({months:4}) AND NOT EXISTS((s)-[:RELATIONSHIP]->(targets_A_L)) AND id(s) <> id(targets_A_L)

  // Allow us to filter on the component id of the targets -> first stream
  CALL gds.graph.streamNodeProperty("similarite_4mois", "wcc")
  YIELD nodeId, propertyValue AS component_id_targets_A_L
  WHERE nodeId = id(targets_A_L) AND component_id_targets_A_L <> component_id_s


  WITH collect(DISTINCT targets_A_L) as targets_A_L_list, collect(DISTINCT component_id_targets_A_L) as component_id_targets_A_L
  RETURN targets_A_L_list, size(component_id_targets_A_L)+1 as nb_wcc_A_L//+1 because we take into account the s component
}
WITH s, component_id_s, targets_A_L_list, nb_wcc_A_L
CALL{
	WITH s, targets_A_L_list, nb_wcc_A_L
    WITH s, targets_A_L_list
    UNWIND targets_A_L_list as targets_A_L
    MERGE (s)-[r:SOFT_SIMILARITE_A_L]->(targets_A_L)
    SET r.NUM_WCC_AL = nb_wcc_A_L
}

// Compute # of component id connected to the same AF
WITH s, component_id_s
CALL {
	  
  WITH s, component_id_s

  MATCH (s)-[:A_F]->(af:AdresseFacturationClean)<-[:A_F]-(targets_af:D)
  WHERE s.ORDERDATE_TIME>=targets_af.ORDERDATE_TIME>=s.ORDERDATE_TIME - duration({months:4}) AND NOT EXISTS((s)-[:RELATIONSHIP]->(targets_af)) AND id(s) <> id(targets_af)

  // A_LLow us to filter on the component id of the targets  --> 2nd stream
  CALL gds.graph.streamNodeProperty("similarite_4mois", "wcc")
  YIELD nodeId, propertyVA_Lue AS component_id_targets_af
  WHERE nodeId = id(targets_af) AND component_id_targets_af <> component_id_s


  WITH collect(DISTINCT targets_af) as targets_af_list, collect(DISTINCT component_id_targets_af) as component_id_targets_af
  RETURN targets_af_list, size(component_id_targets_af)+1 as nb_wcc_af//+1 because we take into account the s component
}
WITH s, component_id_s, targets_af_list, nb_wcc_af
CALL{
	WITH s, targets_af_list, nb_wcc_af
        UNWIND targets_af_list as targets_af
	MERGE (s)-[r:SOFT_SIMILARITE_A_F]->(targets_af)
    	SET r.NOMBRE_WCC_AF = nb_wcc_af
}

I think, if I can optimize this part, the time complexity should drop drastically.
I'll try with your suggested solution and return the id(source) to see if it changes the running time.

Again, thank you !

I did not find a way to optimize this query...
If anyone have an idea on how to optimize it, I am all ears.

Thanks for the hand neo4j community !

I would suggest you clean up the query and remove anything that is not relevant to what you are asking ...

It is unclear if the nodes are directed or not, or of these targets are connected to multiple AL nodes.

So far you seem to want a depth/distance - you are returning "2", even if your question is asking for "number of connected target nodes" which is "1" - only one target is connected, and not "3" as the total number of target nodes?

So I assume your starting point is something like:

MATCH (root:AL {id:'your_id'})
WITH root
CALL {
    WITH root
    MATCH path = (root)-[*1..100]->(leaf:Target)
    WHERE NOT (leaf)-->()
    WITH path
    RETURN MAX(length(path)) AS localMaxDepth
}
RETURN localMaxDepth AS maxDepth

TBF - this looks like a "custom" thing that would be faster to fetch the graph (even if partials) and traverse programmatically to calculate (specially if you want to avoid cycles or count paths once.

Hello @joshcornejo,

First of all, thanks for the help.

Indeed, the edges between the nodes are directed because there is a notion of temporality between them. For instance, an edge from source to target (soure--> target) means that the target node has been inserted into the database before the source node.

Each node (:D) has a unique AL node, but an AL node can have multiple (:D) nodes.

What I want is : from a source node (s:D), count the number of weakly connected components connected to the same (:AL) node.

The code may look like :

MATCH (source:D)-[]->(:AL)<-[]-(targets:D)
WHERE source.DATE > targets.DATE
RETURN source.UNIQUE_ID, count(distinct targets)

The issue with this is that it count the number of node (:D) connected to the same (:AL) and this is not exactly what I want.
What I want is the number of weakly connected component connected to the same (:AL) node.

if you don't put an id ... that's what you'll get?

if you want a list organised by each source, something like this?

MATCH (source:D)-[]->(:AL)<-[]-(targets:D)
WHERE source.DATE > targets.DATE

WITH COLLECT (source) AS sources
  UNWIND sources AS startingPoint  
    MATCH (startingPoint)-[]->(:AL)<-[]-(targets:D)
      WHERE startingPoint.DATE > targets.DATE
    RETURN startingPoint.UNIQUE_ID, count(distinct targets)

Thanks again for the help @joshcornejo.

The code you gave me count the number of targets which have the same AL as the source node. It is pretty similar to my previous code.

It is not exactly what I want.

For instance, if among the target nodes there are one or more connected target nodes, thoses target nodes belong to the same connected component.
Hence I want those target nodes to count as a unique node.
In this example :

If we use all the relationships between target nodes (black and blue), the code will return 2 which correspond to the nodes connected to the same (:AL) node than the source.
Now, if we consider only the black relationships between target nodes, the code will also return 2.
However, what I want is :

  • if we use all the relationships between target nodes (black and blue), the code should return 1 because all the target nodes are in the same connected components.
  • f we consider only the black relationships between target nodes, the code should now return 2.

I hope it's clearer for you ...

It made it more confusing :smiley:

I got lost - if black and blue are considered = 1 ... but if only black = 2 ? Shouldn't it be the other way around?

Sorry to make it more confusing...

No it shouldn't be the other way around.

This is what I want:

  • if we use all the relationships between target nodes (black and blue), the code should return 1 because all the target nodes are in the same connected components.
  • if we consider only the black relationships between target nodes, the code should now return 2.

But if I use the code you gave me, in any case it returns 2.

I don't want to count the number of targets connected to the same (:AL) node than the source node.
I want to count the number of weakly connected components connected to the same (:AL) node than the source node.

If we remove the blue relationship, there are 2 connected components on the AL node (not counting the connected component of source node).
But if we consider the blue relationship, there is only one connected component.

I hope it's clearer now... It's not easy to explain, sorry ...

I've labeled the nodes ... can you tell me which ones you expect to return in your query?

I expect T1 and T2 to be considered as the same node when considering blue and black relationships.

I exêct T1 and T2 to be considered as different nodes chen considering blue and black relationships.

The node T3 is not really usefull in the example.

why would T1 & T2 be 'the same node' (when they are different?) in blue/black consideration.

... and I don't understand the 2nd statement - it is the same.

... if T3 is not useful, why is it in the diagram?

T1 and T2 will be considered as "the same node" because they are in the same connected component.

It was just as an illustrative purpose that T3 was here.

OK ... unfortunately I still don't get it.

What I interpret here is that for each D node represented by “s”, you are filtering your graph based on its order date to create a new subgraph, which you are running the wcc algorithm on. This I believe is the cause of the bottleneck, as the wcc algorithm is running over more than just the nodes connected to “s”, since other D nodes outside those connected to “s” will meet your Order date filter, AND you are running the wcc algorithm for each “s” node. Is it necessary to create the subgraphs based on the filtering criteria, or can you use the wcc results for the whole graph for every “s” node? If no, is it because you want to consider only those nodes throughout the entire graph whose orders occurred in the time window of the “s” node? It would be more efficient if you could identify all the nodes connected to the “s” node’s AL node and use the collection of their ids to incorporate into your filter, so you aren’t performing wcc computation for totally unrelated nodes to “s”. Since you stated each target node is associated with just one AL node, are the target nodes attached the the “s node through the AL node, all clustered together on a subgraph that doesn’t extend out of those group? If yes maybe a cypher variable length pattern can find the collection of target nodes.