Import file that contains many duplicates never finishes


(Oleksii Novikov) #1

Hi there! I've a problem with importing data. I have managed to import really huge datasets without any problem using neo4j-admin import tool. But I've faced with the issue during importing one dataset.
The dataset contains only 2 type of values - id and language code.
Here is a sample of that file http://joxi.ru/J2byBXpcXLyqbm

Here is header file content: :IGNORE languageCode:ID(LanguageCode-ID), so we'll ignore the first field and process as ID the second field
Here is PaperLanguages.txt

847234    en
283432    fr
344533    en

Here is import.conf

--nodes:LanguageCode "PaperLanguages-header.txt,./src/PaperLanguages.txt"
--delimiter \9
--database test
--ignore-extra-columns
--quote \0
--high-io true
--id-type STRING
--ignore-missing-nodes true
--ignore-duplicate-nodes true

Import goes very fast but then stops and never finishes.
Import hangs on this stage (number of batches are not changing during hours):

        Prepare node index
        [*SORT----------------------------------------------------------------------------------------] 109M
        Memory usage: 2.86 GB
        Duration: 46m 29s 423ms
        Done batches: 10911

.......... .......... .......... .......... ..........   5% ∆36m 14s 870ms
.......... .......... .......... .......... ..........  10% ∆2ms
.......... .......... .......... .......... ..........  15% ∆0ms
.......... .......... .......... .......... ..........  20% ∆0ms
.......... .......... .......... .......... ..........  25% ∆0ms
.......... .......... .......... .......... ..........  30% ∆1ms
.......... .......... .......... .......... ..........  35% ∆0ms
.......... .......... .......... .......... ..........  40% ∆0ms
.......... .......... .......... .......... .

I'm using neo4j v3.4.8

does anybody have any ideas what should be done to import this?


(Stefan Armbruster) #2

I've observed that --ignore-duplicate-nodes true can cause performance issues with the importer. My strategy is use external tooling (unix text tools or more fancy stuff ) to ensure you don't have duplicate nodes.


(Oleksii Novikov) #3

Thanks! It looks like you're right. I've prepared data using

sort -u -k2,2  PaperLanguages.txt > PaperLanguages-normalized.txt

After that, there were only 80 unique rows. So import was done in less than one second.
Without this preparation, import was running more than 2 days without success