I am working on a large dataset using Neo4j 4.1 community edition. Every hour, there will be more than 2 million relationships and 10K nodes needing to update or create. It already takes me around one hour to update around 600K relationships for my graph even I have carefully prepared node CSV file and relationship CSV file separately without any duplicate rows in the files.
The process I do to update the graph hourly is:
-
MERGE
the node first by importing the node CSV file. (This process is fast.)
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///node.csv' as line
WITH line
MERGE (status:Event {event_name: line.node, type: line.node_attribute});
-
MATCH
two nodes andMERGE
the relationship.(This process is very slow)
:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///relationship.csv' as line
WITH line
MATCH (prev_status:Event {event_name: line.start_node, type: line.start_node_attribute})
MATCH (status:Event {event_name: line.end_node, type: line.end_node_attribute})
MERGE (prev_status)-[t:TO {attribute_0: coalesce(line.edge_attribute_0, 'None'), attribute_1: coalesce(line.edge_attribute_1, 'None'), dt:date('2020-10-26'), weight: toFloat(line.edge_weight)}]->(status);
As in the graph, there will not be more than one node with the same type and properties. I am wondering if I can boost the relationship update process after finding the first node and just work on that node to update the relationship. My current Cypher will keep finding other nodes even I know there won't be another node satisfying the given condition. I guess if I can find a way to do that, the updating process can be faster. I don't know whether it does matter or not. Or there is another block to make this process such slow.
If not, is there any suggestion to make this update faster? I did a lot of research on it, but I didn't find any solution yet.