Import file that contains many duplicates never finishes

oleksii-novikov · October 5, 2018, 11:56am

Hi there! I've a problem with importing data. I have managed to import really huge datasets without any problem using neo4j-admin import tool. But I've faced with the issue during importing one dataset.
The dataset contains only 2 type of values - id and language code.
Here is a sample of that file http://joxi.ru/J2byBXpcXLyqbm

Here is header file content: :IGNORE languageCode:ID(LanguageCode-ID), so we'll ignore the first field and process as ID the second field
Here is PaperLanguages.txt

847234    en
283432    fr
344533    en

Here is import.conf

--nodes:LanguageCode "PaperLanguages-header.txt,./src/PaperLanguages.txt"
--delimiter \9
--database test
--ignore-extra-columns
--quote \0
--high-io true
--id-type STRING
--ignore-missing-nodes true
--ignore-duplicate-nodes true

Import goes very fast but then stops and never finishes.
Import hangs on this stage (number of batches are not changing during hours):

        Prepare node index
        [*SORT----------------------------------------------------------------------------------------] 109M
        Memory usage: 2.86 GB
        Duration: 46m 29s 423ms
        Done batches: 10911

.......... .......... .......... .......... ..........   5% ∆36m 14s 870ms
.......... .......... .......... .......... ..........  10% ∆2ms
.......... .......... .......... .......... ..........  15% ∆0ms
.......... .......... .......... .......... ..........  20% ∆0ms
.......... .......... .......... .......... ..........  25% ∆0ms
.......... .......... .......... .......... ..........  30% ∆1ms
.......... .......... .......... .......... ..........  35% ∆0ms
.......... .......... .......... .......... ..........  40% ∆0ms
.......... .......... .......... .......... .

I'm using neo4j v3.4.8

does anybody have any ideas what should be done to import this?

stefan.armbruster · October 5, 2018, 8:18pm

I've observed that --ignore-duplicate-nodes true can cause performance issues with the importer. My strategy is use external tooling (unix text tools or more fancy stuff ) to ensure you don't have duplicate nodes.

oleksii-novikov · October 8, 2018, 4:14pm

Thanks! It looks like you're right. I've prepared data using

sort -u -k2,2  PaperLanguages.txt > PaperLanguages-normalized.txt

After that, there were only 80 unique rows. So import was done in less than one second.
Without this preparation, import was running more than 2 days without success

Topic		Replies	Views
Extremely slow import for large graph database using neo4j-admin import Import / Export	3	2266	November 5, 2020
Neo4j-admin import fails on specific ID duplicates Import / Export import , neo4j-desktop	2	434	December 16, 2021
ETL-tool not finishing RDBMS import Import / Export performance	8	1468	August 28, 2019
Neo4j-admin import very slow withou --multiline-fields Import / Export	4	992	December 10, 2019
Bulk Import limitations Import / Export	10	1901	January 15, 2019

Get Certified in June!

Import file that contains many duplicates never finishes

Related topics