Performance issue when importing CSV relationships

eladpr · January 28, 2019, 2:36pm

Hi,
I'm doing a POC which raised the following problem (couldn't find an answer in the forums):
I'm trying to import a CSV containing 10M relationships to a DB pre populated with about ~1.5M nodes with appropriated indices (or so I think).
The import rate starts off fine (~1K relationships per second) but quickly deteriorates. I estimate, it will take weeks to import the dataset. I read that people had no problems importing such datasets in less then an hour.

My setup:
Neo4j 3.5.1
Centos 6 64bit Centos running in a VM with 8GB RAM, 4 CPUs.

I see that the neo4j is consuming 100% cpu during the import (and consumes 2.4G RAM).

Here is an example (that just counts) for just 150000 lines:

PROFILE
LOAD CSV WITH HEADERS FROM "file:/home/Elad/testdir/neo/foo.csv" AS row
WITH row LIMIT 150000
MATCH (subnet:Subnet {subnetID: toInteger(row.SubnetID)})-[]-(device:Device {deviceID: toInteger(row.DeviceID)})
RETURN count(*);

If I import 100000 lines (66%) then it runs 4 times faster. I don't understand why this happens. I thought it should scale linearly.

I have unique indices on Subnet(subnetID) and Device(deviceID)

I'm attaching the output of the above profiled import

Any input will be appreciated

Edit 1:
Here is the actual import:

LOAD CSV WITH HEADERS FROM "file:/home/elad/testdir/neo/foo.csv" AS row
WITH row LIMIT 150000
MATCH (subnet:Subnet {subnetID: toInteger(row.SubnetID)})
MATCH (device:Device {deviceID: toInteger(row.DeviceID)})
MERGE (device)-[:CONNECTS_TO{low:toInteger(row.Low), high:toInteger(row.High), size:toInteger(row.Size)}]->(subnet);

I think the source of the problem is related to the first import (the one that just counts) as it clearly works a lot slower when done on a little more lines. I assume this is done line by line, so I thought the time it takes should be linearly dependent with the number of lines, but clearly it not.

david_allen · January 28, 2019, 2:45pm

What is the create relationship cypher that is the performance problem? Please include that code sample too.

Do you need to match the relationship between the two items you're already matching by ID? Could it ever occur that those nodes by those IDs exist but are not already linked? As you're matching an undirected relationship with no type and no direction, and then not binding it to a variable, that relationship match seems not needed at all.

One small hint you can give to cypher is that if you know there are fewer subnets than devices (I'm guessing) then you can do something like this:

MATCH (subnet:Subnet { subnetID: toInteger(row.SubnetID) })
WITH subnet
MATCH (subnet)-[]-(device:Device {deviceID: toInteger(row.DeviceID)})
(...)

Which may change the plan by forcing the more selective match first.

eladpr · January 28, 2019, 3:11pm

Thanks David,
I added to my post the actual import used.

Topic		Replies	Views
Importing Relationships / Nodes very slow Import / Export performance , cypher , import	3	1120	March 5, 2020
Importing relationships from multiple csv file Import / Export performance , load-csv	12	3286	June 5, 2020
CSV import issue Import / Export	26	845	June 21, 2023
Load-CSV very slow with millions of nodes Import / Export load-csv , import , neo4j-import , csv , neo4j	10	11858	April 7, 2022
Help me merge 170M relationships with LOAD CSV Cypher load-csv	10	3734	October 23, 2019

Take the Course Then Join The Aura Agent Hackathon

Performance issue when importing CSV relationships

Related topics

Take the Course Then Join
The Aura Agent Hackathon