Counting rows performance

I'm trying to count the amount of rows that Neo4j will return but the count (or the query) is very slow.

Version 1 (70 sec):

MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WHERE order.timestamp >= 1632434400 AND size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) <= 20
WITH order
MATCH (order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
RETURN COUNT(product);

Version 2 (68 sec.):

MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WITH size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) AS amount
WHERE order.timestamp >= 1632434400 AND amount <= 20
RETURN SUM(amount)

Using Neo4j 4.4 community with about 800000 orders and about 17000000 order lines.

Is there a more efficient way to count the rows?

These are the indexes:

CREATE INDEX idx_order_torder_id FOR (n:Order) ON (n.order_id);
CREATE INDEX idx_order_timestamp FOR (n:Order) ON (n.timestamp);
CREATE INDEX idx_person_person_id FOR (n:Person) ON (n.person_id);
CREATE INDEX idx_product_product_id FOR (n:Product) ON (n.product_id);

The amount of rows are equal to 4269011.

The EXPLAIN plan:

Maybe this can increase speed:

  • Is it possible to remove redundant data from the query (i.e. to remove Person in first match et.c.)?
  • Maybe avoiding using AND, but instead use several WITH, will increase speed?
  • Maybe using size() function on only the product node (or on a property of the product node) will increase speed?
  • Maybe MATCH-ing on timestamp in the beginning only, will increase speed, since maybe there are not so many nodes matching this criteria (i.e. that have a higher timestamp). You already have an index on timestamp.
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order
MATCH (order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
WITH size(product) AS amount
WHERE amount <= 20
RETURN SUM(amount)

I guess it depends on your data. Maybe you can also refactor your data to increase speed, like this:

MATCH (order)-[:HAS_PRODUCT]->(product:Product)

It seems like your data is imported from a relational database with a many-to-many joining table, that maybe is not necessary in neo4j.

Thank you for your answer martin3. I tried your query but got an error because you can't use size on a node (product variable in this case). Instead of using size, I used count. And because every order line has one product, i can skip the counting of the relation order lines to products:

MATCH (order:Order) 
WHERE order.timestamp >= 1632434400 
WITH order 
MATCH (order)<-[:HAS_ORDER]-(orderLine:OrderLine) 
WITH COUNT(orderLine) as productCount 
WHERE productCount <= 20 
RETURN SUM(productCount);

This query took 0m17.342s

But i managed to snoop some seconds with the following query:

MATCH (order:Order) 
WHERE order.timestamp >= 1632434400
WITH order, size((order)<-[:HAS_ORDER]-(:OrderLine)) AS amount 
WHERE amount <= 20 
RETURN SUM(amount);

This query took 0m15.675s