How to match one node and stop to work on it to update relationship while importing large data?

wedoso · November 17, 2020, 6:35pm

I am working on a large dataset using Neo4j 4.1 community edition. Every hour, there will be more than 2 million relationships and 10K nodes needing to update or create. It already takes me around one hour to update around 600K relationships for my graph even I have carefully prepared node CSV file and relationship CSV file separately without any duplicate rows in the files.

The process I do to update the graph hourly is:

MERGE the node first by importing the node CSV file. (This process is fast.)

:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///node.csv' as line
WITH line
MERGE (status:Event {event_name: line.node, type: line.node_attribute});

MATCH two nodes and MERGE the relationship.(This process is very slow)

:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///relationship.csv' as line
WITH line
MATCH (prev_status:Event {event_name: line.start_node, type: line.start_node_attribute})
MATCH (status:Event {event_name: line.end_node, type: line.end_node_attribute})
MERGE (prev_status)-[t:TO {attribute_0: coalesce(line.edge_attribute_0, 'None'), attribute_1: coalesce(line.edge_attribute_1, 'None'), dt:date('2020-10-26'), weight: toFloat(line.edge_weight)}]->(status);

As in the graph, there will not be more than one node with the same type and properties. I am wondering if I can boost the relationship update process after finding the first node and just work on that node to update the relationship. My current Cypher will keep finding other nodes even I know there won't be another node satisfying the given condition. I guess if I can find a way to do that, the updating process can be faster. I don't know whether it does matter or not. Or there is another block to make this process such slow.

If not, is there any suggestion to make this update faster? I did a lot of research on it, but I didn't find any solution yet.

wedoso · November 18, 2020, 8:16pm

I have figured out a way to improve the performance by adding index to the node properties, which makes this relationship updating process much faster than before (~10 mins for updating 2 million relationships):

CREATE INDEX FOR (e:Event)
ON (e.event_name);
CREATE INDEX FOR (e:Event)
ON (e.type);

I don't know if I use index correctly here, but it does improve the efficiency. On the neo4j doc, I also see the composite index, but I don't know if I can use the composite index here for my use case. Hopefully, the doc can include more details and examples.

However, the other issue that comes out is when I LOAD CSV file, while the file includes around 2 million rows (relationships), the process just reached completed status and return when finishing 520K relationships update. I am wondering if there is any limit for the LOAD CSV operation, even I am using PERIODIC COMMIT. My current solution here is to run the exact same query again to update the rest of relationships in the CSV file.

Topic		Replies	Views
Update data via .csv file Neo4j Graph Platform migrated	7	130	July 4, 2022
Importing relationships from multiple csv file Import / Export performance , load-csv	12	3208	June 5, 2020
Performance issue when importing CSV relationships Import / Export performance , import , csv , index	2	2087	January 28, 2019
Loading in millions of nodes Import / Export performance , cypher , import	0	337	February 18, 2022
Load-CSV very slow with millions of nodes Import / Export load-csv , import , neo4j-import , csv , neo4j	10	11596	April 7, 2022

Demystifying Neo4j UX Research

How to match one node and stop to work on it to update relationship while importing large data?

Related topics