Neo4j-admin import uses only one cpu core after a while

import

(Wallerprogramm) #1

I'm currently importing nodes and relationships from different csv files with a total size of ~45GB using neo4j-admin import. In the beginning all 4 cpu cores were used but from (at least) 55% of (1/4) Node import only one core is used. It is running now for over 16 hours and is still at 60%. You can see that in the following console output:


I use Neo4j version: 3.5.2 (Neo4j Desktop Version 1.1.15).

I started the import with the following command:

./bin/neo4j-admin import \
	--mode=csv \
	--database=btctest.db \
	--nodes $HEADERS/addresses-header.csv,$DATA/addresses.csv \
	--nodes $HEADERS/blocks-header.csv,$DATA/blocks.csv \
	--nodes $HEADERS/transactions-header.csv,$DATA/transactions.csv \
	--relationships $HEADERS/before_rel-header.csv,$DATA/before_rel.csv \
	--relationships $HEADERS/belongs_to_rel-header.csv,$DATA/belongs_to_rel.csv \
	--relationships $HEADERS/receives_rel-header.csv,$DATA/receives_rel.csv \
	--relationships $HEADERS/sends_rel-header.csv,$DATA/sends_rel.csv \
	--ignore-missing-nodes=true \
	--ignore-duplicate-nodes=true \
	--multiline-fields=true \
	--high-io=true

The headers are the following way (as an example I post one node header and one relationship header):
node (transactions-header.csv):

txid:ID,:LABEL 

relationship (sends_rel-header.csv):

:START_ID,value,:END_ID,:TYPE

Is it normal that neo4j uses only one cpu core after a while? And is it normal that the import with the import tool takes that long? Do you have any recommendations on how to make this faster? By the way I use SSD.


(Michael Hunger) #2

Hmm usually it is quite effiicient.
Do you by chance have a lot of duplicate nodes in your data?

Btw. you can pass the main label/rel-type directly on the command line.
You could try to configure less heap (e.g. 2G) and reserve the rest to the heap.

can you get the "c" and "i" outputs in that stage?


(Wallerprogramm) #3

Only the addresses.csv can contain duplicates. I try to remove the duplicates before importing the data.
Yes, I realised that just after I've started generating the csv files.
"c" and "i" work in that stage. This is the output ("i"):


It seems that the index creation is the problem. Might this be because I use quite large ids (64 characters)? If this would be the case, then I should probably rethink whether I really need an index on these ids and introduce smaller ids as actual ids and add the 64 character ids just as unindexed properties.
Is it true that the index preparation cannot be done in parrallel? This would explain why only one core is used in that stage of import.


(Michael Hunger) #4

It does not create an index at this stage. It's probably just the de-duplication.

I'll ask the devs about it, this is really not what it should look like, your whole import should be done in a few minutes.

Also if you look at your memory information it seems there is not much available.


(Michael Hunger) #5

Answer from the team.

there are some special cases where the sorting in there isn't particularly optimal and only one thread gets the majority of the work.
So the solution for him would be to de-duplicate upfront with unix tools for the time being.

Sorry for that, it's something we're going to address going forward.


(Wallerprogramm) #6

Removing duplicates in advance did it. The import took now less than 30 minutes. Thank you for your help.