neo4j version: Community 4.2.4
dbms.memory.heap.initial_size=24G
dbms.memory.heap.max_size=24G
dbms.memory.pagecache.size=8G
desktop version: 1.4.3
Hello,
What advice or suggestions would you recommend to improve on my cypher query?
I've read several sources regarding Cypher tuning before I posted this question:
From these readings, I've learned several areas that I could change to improve my query.
Here's an example of the offending query taking 11 hours to finish:
CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///newdata_04_08_21.csv', {header:false}) YIELD list AS row RETURN row
","
MATCH(p:Person {person_id: row[0]}), (b:Business:NewLaw {business_id: row[1]})
MERGE(p)-[r:FIRST_LAW]->(b)
ON MATCH
SET r.neo4jLastUpdate = datetime( {timezone: 'America/Los Angeles'} )
ON CREATE
SET r.neo4jCreated = datetime( {timezone: 'America/Los Angeles'} )
RETURN count(*)",
{batchSize: 100000, parallel: false})
I also have uniqueness constraints on the Person and Business nodes as follows:
CREATE CONSTRAINT PersonUnique ON (p:Person) ASSERT p.person_id IS UNIQUE;
CREATE CONSTRAINT BusinessUnique ON (b:Business) ASSERT b.business_id IS UNIQUE;
This query is passed to my Python script which is scheduled to run daily at 21:00 every night to update my Neo4j database. Last night, it took 11 hours for this query to finish.
The newdata_04_08_21.csv is about 2G and it does not have a header. This file is provided to me on a daily basis with millions of rows and its contents look similar to this:
joe-1234,electronics88,FIRST_LAW
jane-1234,retail145,FIRST_LAW
sam-5788,education179,FIRST_LAW
Based on what I learned from the Cypher tuning guides, it is best to avoid a cartesian product and to aggregate early. A better approach for my query would be this:
CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///newdata_04_08_21.csv', {header:false}) YIELD list AS row RETURN row
","
MATCH(p:Person)
WHERE p.person_id: row[0]
WITH p
MATCH(b:Business:NewLaw)
WHERE b.business_id=row[1]
WITH p, b
MERGE(p)-[r:FIRST_LAW]->(b)
ON MATCH
SET r.neo4jLastUpdate = datetime( {timezone: 'America/Los Angeles'} )
ON CREATE
SET r.neo4jCreated = datetime( {timezone: 'America/Los Angeles'} )
RETURN count(*)",
{batchSize: 100000, parallel: false})
With this revised query, I am avoiding the cartesian product because now I have two different MATCH statements supplemented with the WHERE clause to search for the unique node in the csv. I am also first collecting the unique names of the individual and business with the WITH statement, and then passing these two parameters to the MERGE statement.
Would this be an appropriate method to improve my cypher query?
I tried to generate the EXPLAIN execution plan, but it's not very descriptive with APOC statements.
I also experimented with the parallel: parameter on the apoc.periodic.iterate from false to true, but I'm running this apoc function to update the relationships in my graph daily, so I left it at false in order to avoid conflicting transactions.
Any advice here would be much appreciated.
Thanks