Creating masses of new nodes take a very long time
Hi There,
I am a little desperate with the following problem:
I have written a python routine for executing 4.8 million MERGE statements to create nodes in a db. The script uses neo4j python module to execute the MERGE clauses.
The statements look about this
</>MERGE (f:flights { date: date({year: 2015, month: 1, day: 1}), day_of_week: "Thursday", airline: "EV", flight_number: "5103", tail_number: "846AS", origin: "ABE", destination: "DTW", scheduled_departure: time({hour: 6, minute: 0}), real_departure: time({hour: 5, minute: 52}), taxi_out: toInteger(12), air_duration: toInteger(84), taxi_in: toInteger(5), scheduled_arrival: time({hour: 7, minute: 53})}) RETURN f </>
I did the following:
When I started this python script I realized that with a execution time of 100 ms per statement it would take 5 and a half days to accomplish the node creation.
Then I tried to just export the MERGE statements into a text file and execute this from the cypher-shell. No performance improvement.
Then I found the apoc.commit procedure, but could not get it to work (I could not find any other use cases but with the "LOAD CSV" procedure). I tried something like this :
</>
call apoc.periodic.commit("
CREATE (:flights { date: date({year: 2015, month: 1, day: 1}), day_of_week: 'Thursday', airline: 'EV', time({hour: 7, minute: 53})})
CREATE (:flights { date: date({year: 2015, month: 4, day: 14}), day_of_week: 'Tuesday', airline: 'WN', time({hour: 19, minute: 28})})
", limit:1000)
</>
creating the following error:
Invalid input '1000': expected "%", "(", "YIELD" or an identifier (line 4, column 10 (offset: 845))
"", limit:1000) YIELD *"
So far I got. Does anybody have an idea how to improve the performance (or fix the error, if the commit proc. would be a solution)?
The syntax for apoc periodic commit is incorrect, as the configuration parameters are passed as a map. Replace limit:1000 with {limit:1000}
When you merge a node it must first perform a match to determine if the node exists. In your case, you are matching on all of your node attributes since they are all part of the implied where clause, i.e, in the node’s attribute map. I suspect you are not using any indexes, thus every node with the same label needs to be compared for each new merged node. This process will take longer and longer for every new node added. What is typically done is to have a unique identifier for the node that is used for the match, then the remaining attributes are set on the created or matched node. This approach allows an index to be used for the unique identifier. Do you have an attribute or a combination of attributes you can consider as a unique identifier? If so, you can create an index on the unique identifier or a composite index on the combination of identifiers that are unique. You would then refactor your code to only include the unique identifier(s) in the node’s attribute map when merging and move the remaining attributes to a SET clause or a combination of ON MATCH and ON CREATE clauses if setting is different when creating vs. matching.
One comment, you don’t need to store the day of the week, as it can be derived from the date and you can combined the date and time into a timestamp. Do you have flight number, as I think the combination of flight number and date is unique.
Please consider that all the properties used on a MERGE statement will be used in order to determine the pre-existence of the node, and therefore, needed on an index. Consider breaking out your merge statement into:
MERGE(f:flights {prop1 : val1 , prop2 : val2})
ON CREATE SET s += {prop3 : val3 ... , propn : valn}
While having an index on :
CREATE INDEX myIndexName IF NOT EXISTS
FOR (n:flights)
ON (n.prop1,
n.prop2)
Thanks so much for the quick and comprehensive answers. Since this is an initial
load to the db I have skipped the MERGE clause for a CREATE one, which significantly improves
the import (It should be done now in 7 hours, I should have tried this before). Yet, I will start using indexes as you both suggested for my further
work to enhance performance.
One note to Gary. I tried to write the limit argument in curly brackets, but it gave
me another error:
Failed to invoke procedure apoc.periodic.commit: Caused by: java.lang.IllegalArgumentException: the statement sent to apoc.periodic.commit must contain a limit
As I said, I could not find any docs describing the periodic commit functionality
other than in use with LOAD CSV. I wonder, if it cannot be used the way I did by making
two or more statements in the parenthesis and then state the limit.
Thanks
Alex
My bad...your query is only creating two nodes, so the apoc.periodic.commit is not necessary. It is used when you have a query that results in a lot of match or merges, and you want the operations on those results batched into smaller transactions, so you don't need as much memory.
If you are loading from a csv file, apoc.periodic.iterate is appropriate. Here is an example where one row is read from the file, which is used to create one flight. The first statement is to generate the rows of data to batch. The second statement is executed for each row of data in each batch.
Besides adding the index, you need to start using parameters instead of constants in you cypher. When you use constants, every cypher is unique and the optimizer has to optimize and compile each one. With parameters the cypher is optimized and complied once, then cached so it can be used by subsequent calls to the same cypher but with different parameters. This should cut your execution time considerably from 7 hours to probably an hour or two depending on your hardware.
Hi Myron,
thanks a lot for that additional hint. I have now loaded the data and indeed it has taken around 12 hours.
However, I have not understood, what you mean by "parameters". In my CREATE clause I have to explicitly describe the single label and property values with string values (what you probably refer to as "constants"). How can I insert these with a parameter function?
Any comment is much appreciated.
Best Regards,
Alex
I am not a python programmer, but within the Neo4j driver there should be a way to pass parameters with a cypher. Your cypher would then reference the parameters using dollar ($) variables. Here is a short example:
You will have to do some research on how to do this in python but nodejs it is something like "tx.run (query, params)"
Once you get parameters working, I'd recommend passing an array of records as a parameter and UNWIND that array in the cypher. You won't be able to put all 4.3 million records in one array but you could put 1000 at a time, resulting in running the cypher 4.3 thousand times instead. This will cut your time down even more.
I was not aware of this function in neo4j. So, I create a cypher script with an UNWIND clause with a list to create a loop over a CREATE clause. I then could utilize the LOAD facility to bulk execute this. That is indeed helpful.
The „tx.run“ api exists in Python too, but when I loop this, I create sequential CREATE statements and execute them - sequentially. No win there, but the hack you sketch above I will try.