Slow Cypher Query for combination GDS calls

Hello

I'm working with a graph database that represents public transport and I'm trying to calculate certain shortest paths through it. The following query takes 14 seconds to complete and I'd love any help on speeding it up:

MATCH (schulort:Haltestelle { BPUIC: 8502202 })
CALL gds.alpha.shortestPath.deltaStepping.stream({
  nodeProjection: 'Haltestelle',
  relationshipProjection: {
    FAHRT_NACH: {
    type: 'FAHRT_NACH',
    properties: 'Reisedauer'
    }
  },
  startNode: schulort,
  relationshipWeightProperty: 'Reisedauer',
  delta: 3.0
})
YIELD nodeId AS zielId, distance AS Reisedauer
WHERE gds.util.isFinite(Reisedauer)
WITH schulort, Reisedauer, gds.util.asNode(zielId) AS ziel
CALL gds.alpha.shortestPath.write({
  nodeProjection: 'Haltestelle',
  relationshipProjection: {
    FAHRT_NACH: {
      type: 'FAHRT_NACH',
      properties: 'Ausfälle'
    }
  },
  startNode: schulort,
  endNode: ziel,
  relationshipWeightProperty: 'Ausfälle'
})
YIELD nodeCount AS tempCount, totalCost AS Ausfälle
CALL gds.alpha.shortestPath.write({
  nodeProjection: 'Haltestelle',
  relationshipProjection: {
    FAHRT_NACH: {
      type: 'FAHRT_NACH',
      properties: 'Verspätung'
    }
  },
  startNode: schulort,
  endNode: ziel,
  relationshipWeightProperty: 'Verspätung'
})
YIELD nodeCount AS tempCount2, totalCost AS Verspätung
MATCH (ziel:Haltestelle)-[:BEFINDET_SICH_IN]->(o:Ortschaft)-[:GEHOERT_ZU]->(k:Kanton)
RETURN
  k.Kürzel AS Kanton,
  o.Name AS Ort,
  k.Mietpreis AS Mietpreis,
  k.Steuerfuss AS Steuerfuss,
  k.Einbrüche AS Einbrüche,
  min(Reisedauer) AS Reisedauer,
  min(Ausfälle) AS Ausfälle, 
  min(Verspätung) AS Verspätung

The execution plan is the following:

One issue I see is that the anchoring query doesn't use an index. But I'm unsure why the rest takes so long. Some help would be appreciated.

Hey,

The first thing that I'd try is to create a named graph, rather than using an anonymous one in each query. If you use an anonymous graph it means that GDS is loading the same data from Neo4j into memory 3 times (one time per procedure).

CALL gds.graph.create(
    'my-graph',
    'Haltestelle',
    {
    FAHRT_NACH: {
      type: 'FAHRT_NACH',
      properties: ['Verspätung', 'Ausfälle', 'Reisedauer']
    }
  }
)

And then you can use that graph in the query:

MATCH (schulort:Haltestelle { BPUIC: 8502202 })
CALL gds.alpha.shortestPath.deltaStepping.stream("my-graph", {
  startNode: schulort,
  relationshipWeightProperty: 'Reisedauer',
  delta: 3.0
})
YIELD nodeId AS zielId, distance AS Reisedauer
WHERE gds.util.isFinite(Reisedauer)
WITH schulort, Reisedauer, gds.util.asNode(zielId) AS ziel
CALL gds.alpha.shortestPath.write("my-graph", {
  startNode: schulort,
  endNode: ziel,
  relationshipWeightProperty: 'Ausfälle'
})
YIELD nodeCount AS tempCount, totalCost AS Ausfälle
CALL gds.alpha.shortestPath.write("my-graph", {
  startNode: schulort,
  endNode: ziel,
  relationshipWeightProperty: 'Verspätung'
})
YIELD nodeCount AS tempCount2, totalCost AS Verspätung
MATCH (ziel:Haltestelle)-[:BEFINDET_SICH_IN]->(o:Ortschaft)-[:GEHOERT_ZU]->(k:Kanton)
RETURN
  k.Kürzel AS Kanton,
  o.Name AS Ort,
  k.Mietpreis AS Mietpreis,
  k.Steuerfuss AS Steuerfuss,
  k.Einbrüche AS Einbrüche,
  min(Reisedauer) AS Reisedauer,
  min(Ausfälle) AS Ausfälle, 
  min(Verspätung) AS Verspätung

Hopefully, that will reduce the query time, but let me know if not.

Thank you. The first time the fixed query is run it still takes about 12 seconds. The second time it only takes about 4 seconds. The execution plan still looks the same though with the same amount of db hits. Probably because the actual work is hidden inside those gds library calls? Is there a way of quantifying the costs of those java library calls?

I have since written another query for retrieving the same data but written with UNIONs. This query runs in 300ms but instead of 14'000 db hits it has 35'600 db hits. I find the first query to be simpler to grasp though. A few questions popped up in my mind:

  • Why is the second query that much faster? Is it because it can be run in parallel and only uses one GDS library call per graph instead of one call per row?
  • The second query uses the PIPELINED runtime, the first one SLOTTED. Is there a list of operators that are not supported with the PIPELINED runtime somewhere? I couldn't find any online

Here's the second query:

CALL {
    MATCH (schulort:Haltestelle { BPUIC: 8502202 }) // Hole Rotkreuz als Haltestelle.
    CALL gds.alpha.shortestPath.deltaStepping.stream({ // Berechne kürzeste Dauer von Rotkreuz zu allen Haltestellen der Schweiz.
    nodeProjection: 'Haltestelle',
    relationshipProjection: {
        FAHRT_NACH: {
        type: 'FAHRT_NACH',
        properties: 'Reisedauer'
        }
    },
    startNode: schulort,
    relationshipWeightProperty: 'Reisedauer',
    delta: 3.0
    })
    YIELD nodeId AS zielId, distance AS Value
    WHERE gds.util.isFinite(Value)
    RETURN zielId, Value, 'Reisedauer' AS Typ
    UNION
    MATCH (schulort:Haltestelle { BPUIC: 8502202 }) // Hole Rotkreuz als Haltestelle.
    CALL gds.alpha.shortestPath.deltaStepping.stream({ // Berechne kürzeste Dauer von Rotkreuz zu allen Haltestellen der Schweiz.
    nodeProjection: 'Haltestelle',
    relationshipProjection: {
        FAHRT_NACH: {
        type: 'FAHRT_NACH',
        properties: 'Verspätung'
        }
    },
    startNode: schulort,
    relationshipWeightProperty: 'Verspätung',
    delta: 3.0
    })
    YIELD nodeId AS zielId, distance AS Value
    WHERE gds.util.isFinite(Value)
    RETURN zielId, Value, 'Verspätung' AS Typ
    UNION
    MATCH (schulort:Haltestelle { BPUIC: 8502202 }) // Hole Rotkreuz als Haltestelle.
    CALL gds.alpha.shortestPath.deltaStepping.stream({ // Berechne kürzeste Dauer von Rotkreuz zu allen Haltestellen der Schweiz.
    nodeProjection: 'Haltestelle',
    relationshipProjection: {
        FAHRT_NACH: {
        type: 'FAHRT_NACH',
        properties: 'Ausfälle'
        }
    },
    startNode: schulort,
    relationshipWeightProperty: 'Ausfälle',
    delta: 3.0
    })
    YIELD nodeId AS zielId, distance AS Value
    WHERE gds.util.isFinite(Value)
    RETURN zielId, Value, 'Ausfälle' AS Typ
}
WITH gds.util.asNode(zielId) AS ziel, Value, Typ
ORDER BY Value
MATCH (ziel:Haltestelle)-[:BEFINDET_SICH_IN]->(o:Ortschaft)-[:GEHOERT_ZU]->(k:Kanton)
WITH o, k, collect({Value: Value, Typ: Typ}) AS values
RETURN
	k.Name,
    o.Name,
    k.Mietpreis,
    k.Steuerfuss,
    k.Einbrüche,
    head([n IN values WHERE n.Typ = 'Reisedauer' | n.Value]) AS Reisedauer,
    head([n IN values WHERE n.Typ = 'Ausfälle' | n.Value]) AS Ausfälle,
    head([n IN values WHERE n.Typ = 'Verspätung' | n.Value]) AS Verspätung

And here's the corresponding execution plan: