Optimising properties access

Hello there

I was reading the optimising properties access chapter in the query tuning course and I wanted to double check something with you:

If the first elapsed time is the time for the query to be executed and the second time is the one for the query results to be all streaming to the end client thought a network or not, does it mean that:

If there are eager aggregation or operation in the query the time between both the first and the second elapsed time will be shorter as the query will have process every single row anyway almost up to the end of the query ( depending of the query ) before being able to stream the first result?

If there is no eager operation or aggregation at all or implicit infinite loop protection, the query results can be produce for a single row immediately explaining why some queries are able to have a 1 ms query time but usually a much higher time to stream the results.

Thanks

Hi Gabriel,

That's my general understanding.

As you said, it does depend upon the query. Now that we have subqueries since Neo4j 4.1, we have some additional flexibility. Subqueries can be used to scope aggregations (they get called per-row), so aggregations within a subquery do not require processing all input rows outside of the subquery.

As a quick example:

MATCH (m:Movie)
CALL {
 WITH m
 MATCH (m)<-[:ACTED_IN]-(actor:Person)
 RETURN collect(actor) as actors
}
RETURN m, actors

Because the subquery executes per row (in this case, per m), the first execution of a collect() happens for the expansion on the first movie node, and doesn't require any processing on any other movie node in order to complete.

The data being aggregated is also much less, and will complete quicker. In this case it will be the actors for an individual movie, not all actors for all movies. We're trading off a single large aggregation across the entire input set, for many aggregations that each execute on much smaller input sets.

I do want to quickly say that with Cypher alone you can't get caught in an infinite loop, as the relationship isomorphism used by Cypher prevents a relationship from being used more than once per path.

Thank you @andrew.bowman

Great and precise answer, your subquery example should be add to the reducing cardinality chapter. I understood something great about subquery just with this example.

As I understood it, the subquery version will allow to start streaming data sooner than the regular version who will wait for all the actors of all the movies to be aggregated before streaming a single row to the client.

But I don't know if it's actually the case when I look at the query plan.

Is there anyway to monitor the behaviour you just explained?

I'm not aware of a way to monitor it.

For plan analysis, the key is recognition of the Apply operator, which indicates that the righthand steps in the plan are executed per incoming row from the lefthand side.

From this, we can tell that the EagerAggregation is in the righthand of the Apply, so that verifies the scoping of the aggregation. As soon as the first row from the label scan finishes its run through the Apply, the results from it are ready (and then it's up to the network and driver code to decide when/how to stream the results).

As there are no further aggregations happening outside the apply, we know that we don't have to wait for ALL actors involved in the query to aggregate, so we end up streaming the data sooner.

1 Like

I just finish the Query Tuning course, I guess if I want to become a real query tuning master my best option is to read more about each operator in the manual.

Thank you again have a nice day