I want to understand the logic for neo4j-etl cli export

takuya · February 7, 2020, 12:42pm

Hi there.
I tried neo4j-etl cli tool V1.3.1.

This is very fast for importing only nodes, but it seems to be very slowly when relations import.

Is it check the connection target table in the process of generating the CSV?

Details are as follows,

Environment

Neo4j 3.5.11
Neo4j ETL Tool V1.3.1
PostgreSQL
Server Memory 8g

Situation

RDB(PostgreSQL) has one table for node and relation table like under ER.
There are 3 sets it same structural tables and actually more columns has it.

I edited the mapping.json like this.

[ {
  "name" : "NODE1_db1",
  "schema" : "db1",
  "graph-object-type" : "Node",
  "sql" : "SELECT id FROM db1.company AS company",
  "mappings" : [ {
    "column" : {
      "type" : "CompositeColumn",
      "table" : "company",
      "schema" : "db1",
      "role" : "PrimaryKey",
      "columns" : [ {
        "type" : "SimpleColumn",
        "role" : "Data",
        "table" : "company",
        "schema" : "db1",
        "name" : "id",
        "alias" : "id",
        "sql-data-type" : "INTEGER",
        "column-value-selection-strategy" : "SelectColumnValue"
      } ]
    },
    "field" : {
      "type" : "Id",
      "name" : "",
      "id-space" : "db1.company"
    }
  }]}, {
  "name" : "Relation_db1",
  "schema" : "db1",
  "graph-object-type" : "Relation",
  "sql" : "SELECT start_id, end_id, relation_type FROM (SELECT start_id, end_id, relation_type FROM db1.company AS start_company JOIN db1.relation AS relation ON start_company.id = relation.start_id JOIN db1.company AS end_company ON end_company.id = relation.end_id) AS company_relation",
  "mappings" : [ {
    "column" : {
      "type" : "SimpleColumn",
      "role" : "Data",
      "table" : "company",
      "schema" : "db1",
      "name" : "start_id",
      "alias" : "start_id",
      "sql-data-type" : "INTEGER",
      "column-value-selection-strategy" : "SelectColumnValue"
    },
    "field" : {
      "type" : "startId",
      "name" : "start_id",
      "id-space" : "db1.company_relation"
    }},{
    "column" : {
        "type" : "SimpleColumn",
        "role" : "Data",
        "table" : "company",
        "schema" : "db1",
        "name" : "end_id",
        "alias" : "end_id",
        "sql-data-type" : "INTEGER",
        "column-value-selection-strategy" : "SelectColumnValue"
    },
    "field" : {
        "type" : "endId",
        "name" : "end_id",
        "id-space" : "db1.company_relation"
    }}, {
      "column" : {
        "type" : "SimpleColumn",
        "role" : "Data",
        "table" : "company",
        "schema" : "db1",
        "name" : "relation_type",
        "alias" : "relation_type",
        "sql-data-type" : "INTEGER",
        "column-value-selection-strategy" : "SelectColumnValue"
      },
      "field" : {
        "type" : "RELATION_TYPE",
        "name" : "relation_type",
        "id-space" : "db1.company_relation"
      }
  }]
  }
]

The number of result records is 700,000 nodes and 2,500,000 relations.

Result

10300s, about 3 hours..

When I deleted the description of Relation and tried to execute it only on the node, it ended in about 20 seconds. Searching the RDB with SQL alone takes only a few seconds, so I think that there is something in the Neo4j ETL logic.

please let me know if you have any informations
Thanks.

Thomas_Silkjaer · February 7, 2020, 12:52pm

Importing relationships is very very slow. When creating a relationship between two nodes, the database locks both nodes. So it is not even possible to speed it up with parallel processing unless it can be guaranteed that no two processes run at the same time, is making a relationship with an overlapping node.

I don't have any experience with the ETL tool, but if you are injecting data in a new database, I would definitely look into using the neo4j-admin import tool and make exports from the PostgreSQL database to CSV, preprocess it with a scripting language of your choice (make sure there are no duplicates, escape quotes, write csv files as documented …), and then run the import.

takuya · February 7, 2020, 1:09pm

Lock!!!! That's the point!

As you say, admin-import is very fast it alone, so I'm going to go that way CSV export from PostgreSQL.

Your explanation was really helpful!
Thanks a lot

Topic		Replies	Views
ETL for loading csv ETL-Tool	1	416	July 11, 2020
Import records from multiple db's at once ETL-Tool etl , export , import , neo4j-import	3	465	November 22, 2021
Analysis of big data and use of ETL ETL-Tool	11	1187	August 26, 2019
Import problems in neo4j Neo4j Graph Platform migrated	14	174	October 2, 2022
Help me merge 170M relationships with LOAD CSV Cypher load-csv	10	3639	October 23, 2019

July Summer Fun!

I want to understand the logic for neo4j-etl cli export

Environment

Situation

Result

Related topics