I'm trying to count the amount of rows that Neo4j will return but the count (or the query) is very slow.
Version 1 (70 sec):
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WHERE order.timestamp >= 1632434400 AND size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) <= 20
WITH order
MATCH (order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
RETURN COUNT(product);
Version 2 (68 sec.):
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WITH size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) AS amount
WHERE order.timestamp >= 1632434400 AND amount <= 20
RETURN SUM(amount)
Using Neo4j 4.4 community with about 800000 orders and about 17000000 order lines.
Is there a more efficient way to count the rows?
These are the indexes:
CREATE INDEX idx_order_torder_id FOR (n:Order) ON (n.order_id);
CREATE INDEX idx_order_timestamp FOR (n:Order) ON (n.timestamp);
CREATE INDEX idx_person_person_id FOR (n:Person) ON (n.person_id);
CREATE INDEX idx_product_product_id FOR (n:Product) ON (n.product_id);
Is it possible to remove redundant data from the query (i.e. to remove Person in first match et.c.)?
Maybe avoiding using AND, but instead use several WITH, will increase speed?
Maybe using size() function on only the product node (or on a property of the product node) will increase speed?
Maybe MATCH-ing on timestamp in the beginning only, will increase speed, since maybe there are not so many nodes matching this criteria (i.e. that have a higher timestamp). You already have an index on timestamp.
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order
MATCH (order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
WITH size(product) AS amount
WHERE amount <= 20
RETURN SUM(amount)
I guess it depends on your data. Maybe you can also refactor your data to increase speed, like this:
MATCH (order)-[:HAS_PRODUCT]->(product:Product)
It seems like your data is imported from a relational database with a many-to-many joining table, that maybe is not necessary in neo4j.
Thank you for your answer martin3. I tried your query but got an error because you can't use size on a node (product variable in this case). Instead of using size, I used count. And because every order line has one product, i can skip the counting of the relation order lines to products:
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order
MATCH (order)<-[:HAS_ORDER]-(orderLine:OrderLine)
WITH COUNT(orderLine) as productCount
WHERE productCount <= 20
RETURN SUM(productCount);
This query took 0m17.342s
But i managed to snoop some seconds with the following query:
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order, size((order)<-[:HAS_ORDER]-(:OrderLine)) AS amount
WHERE amount <= 20
RETURN SUM(amount);