Performance issue when importing CSV relationships

performance
import
csv
index

(Eladpr) #1

Hi,
I'm doing a POC which raised the following problem (couldn't find an answer in the forums):
I'm trying to import a CSV containing 10M relationships to a DB pre populated with about ~1.5M nodes with appropriated indices (or so I think).
The import rate starts off fine (~1K relationships per second) but quickly deteriorates. I estimate, it will take weeks to import the dataset. I read that people had no problems importing such datasets in less then an hour.

My setup:
Neo4j 3.5.1
Centos 6 64bit Centos running in a VM with 8GB RAM, 4 CPUs.

I see that the neo4j is consuming 100% cpu during the import (and consumes 2.4G RAM).

Here is an example (that just counts) for just 150000 lines:

PROFILE
LOAD CSV WITH HEADERS FROM "file:/home/Elad/testdir/neo/foo.csv" AS row
WITH row LIMIT 150000
MATCH (subnet:Subnet {subnetID: toInteger(row.SubnetID)})-[]-(device:Device {deviceID: toInteger(row.DeviceID)})
RETURN count(*);

If I import 100000 lines (66%) then it runs 4 times faster. I don't understand why this happens. I thought it should scale linearly.

I have unique indices on Subnet(subnetID) and Device(deviceID)


I'm attaching the output of the above profiled import

Any input will be appreciated

Edit 1:
Here is the actual import:

LOAD CSV WITH HEADERS FROM "file:/home/elad/testdir/neo/foo.csv" AS row
WITH row LIMIT 150000
MATCH (subnet:Subnet {subnetID: toInteger(row.SubnetID)})
MATCH (device:Device {deviceID: toInteger(row.DeviceID)})
MERGE (device)-[:CONNECTS_TO{low:toInteger(row.Low), high:toInteger(row.High), size:toInteger(row.Size)}]->(subnet);

I think the source of the problem is related to the first import (the one that just counts) as it clearly works a lot slower when done on a little more lines. I assume this is done line by line, so I thought the time it takes should be linearly dependent with the number of lines, but clearly it not.


(M. David Allen) #2

What is the create relationship cypher that is the performance problem? Please include that code sample too.

Do you need to match the relationship between the two items you're already matching by ID? Could it ever occur that those nodes by those IDs exist but are not already linked? As you're matching an undirected relationship with no type and no direction, and then not binding it to a variable, that relationship match seems not needed at all.

One small hint you can give to cypher is that if you know there are fewer subnets than devices (I'm guessing) then you can do something like this:

MATCH (subnet:Subnet { subnetID: toInteger(row.SubnetID) })
WITH subnet
MATCH (subnet)-[]-(device:Device {deviceID: toInteger(row.DeviceID)})
(...)

Which may change the plan by forcing the more selective match first.


(Eladpr) #3

Thanks David,
I added to my post the actual import used.