I am loading a csv file to create nodes in neo4j, when I tried with a csv of around 1,000 rows it took 1 second for the nodes to be created, when I increased my dataset to 3,000 rows it is taking 15 seconds.
Can someone please suggest, how to reduce this time and why is this difference coming for 3,000 rows ?
What is the best way to create a graph using csv with large dataset ?
Below is the query that I use to create my graph nodes and set the properties:
load csv with headers from 'file:///storage.csv' as line
merge(a:Storage{name:line.code+" "+date(line.Date).month+"-"+date(line.Date).year+" "+line.Product}) on create set a.Incoming_Stock=toFloat(line.incoming_stock),a.Opening_Inv_Physical=toFloat(line.opening_inventory_physical),a.Target_Closing_Inv=toFloat(line.target_closing_inventory),a.Outflow_Requirement=toFloat(line.outflow_requirement), a.date=date(line.Date), a.Product=line.Product,a.Node=line.code;
No I haven't created a UNIQUE CONSTRAINT, but the property name which I am creating for my nodes will always be unique as that's how I create my input csv. So I know that multiple nodes will not be created.
Just now I tried to create Unique Constraint on my Storage Node, after importing the csv and creating the node. But how will this reduce the time taken to load csv ?
Its taking 15 seconds just to create 3000 nodes in my graph.
Hi,
My nodes are now getting created within milliseconds as I changed my cypher query from merge to create, but my relationship is taking around 60 seconds. Any suggestions on how to increase the speed of relationship creation from csv.
Below is the code that I have used:
load csv with headers from 'file:///transport_laporte_db10.csv' as line
match(sender:Storage{name:line.sender_node+" "+date(line.sender_date).month+"-"+date(line.sender_date).year+" "+line.Product})
match(receiver:Storage{name:line.receiver_node+" "+date(line.receiver_date).month+"-"+date(line.receiver_date).year+" "+line.Product})
merge(sender)-[rel:transport{mode:line.mode,lead_time:toInteger(line.lead_time), quota:toInteger(line.quota)}]->(receiver);
When there is no index present, then, per row, Cypher will do a label scan for every single :Storage node, performing property access to see if the node exists.
So if you have 10000 :Storage nodes in the database, and 3000 rows in the CSV, then it will be performing 3000 label scans, meaning that it will ultimately be doing 3000 * 10000 = 30000000 node comparisons. So the speed of loading becomes linearly proportional to the number of nodes with the given label * the number of rows in your CSV, and that's only considering a single MERGE. If there are multiple MERGEs on nodes that aren't index-backed, then the problem compounds.
By contrast, when there is an index in place, then there will be only one index lookup performed per row, so 3000 index lookups, which are quite quick.