cancel
Showing results for 
Search instead for 
Did you mean: 

How to match one node and stop to work on it to update relationship while importing large data?

wedoso
Node

I am working on a large dataset using Neo4j 4.1 community edition. Every hour, there will be more than 2 million relationships and 10K nodes needing to update or create. It already takes me around one hour to update around 600K relationships for my graph even I have carefully prepared node CSV file and relationship CSV file separately without any duplicate rows in the files.

The process I do to update the graph hourly is:

  1. MERGE the node first by importing the node CSV file. (This process is fast.)
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///node.csv' as line
WITH line
MERGE (status:Event {event_name: line.node, type: line.node_attribute});
  1. MATCH two nodes and MERGE the relationship.(This process is very slow)
:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///relationship.csv' as line
WITH line
MATCH (prev_status:Event {event_name: line.start_node, type: line.start_node_attribute})
MATCH (status:Event {event_name: line.end_node, type: line.end_node_attribute})
MERGE (prev_status)-[t:TO {attribute_0: coalesce(line.edge_attribute_0, 'None'), attribute_1: coalesce(line.edge_attribute_1, 'None'), dt:date('2020-10-26'), weight: toFloat(line.edge_weight)}]->(status);

As in the graph, there will not be more than one node with the same type and properties. I am wondering if I can boost the relationship update process after finding the first node and just work on that node to update the relationship. My current Cypher will keep finding other nodes even I know there won't be another node satisfying the given condition. I guess if I can find a way to do that, the updating process can be faster. I don't know whether it does matter or not. Or there is another block to make this process such slow.

If not, is there any suggestion to make this update faster? I did a lot of research on it, but I didn't find any solution yet.

1 REPLY 1

wedoso
Node

I have figured out a way to improve the performance by adding index to the node properties, which makes this relationship updating process much faster than before (~10 mins for updating 2 million relationships):

CREATE INDEX FOR (e:Event)
ON (e.event_name);
CREATE INDEX FOR (e:Event)
ON (e.type);

I don't know if I use index correctly here, but it does improve the efficiency. On the neo4j doc, I also see the composite index, but I don't know if I can use the composite index here for my use case. Hopefully, the doc can include more details and examples.

However, the other issue that comes out is when I LOAD CSV file, while the file includes around 2 million rows (relationships), the process just reached completed status and return when finishing 520K relationships update. I am wondering if there is any limit for the LOAD CSV operation, even I am using PERIODIC COMMIT. My current solution here is to run the exact same query again to update the rest of relationships in the CSV file.

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.