cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Expect disruption to service on Thursday, Feb. 9!

Apoc.load.jdbc data import connection time out

mengjiakang2017
Node Clone

Hello all,

I'm loading data from Azure SQL server to neo4j hosted on docker in a Linux server. one of the table that includes ~7 million rows cannot be loaded in one run and kept causing connection timeout issue.
can someone help on configuring the timeout during the JDBC connection?

below is my connection string and import query -

:param azure_url => 'jdbc:sqlserver://SERVER;databaseName=DATABASE;user=User;password=PW';

CALL apoc.periodic.iterate('call apoc.load.jdbc($azure_url, "select distinct m.person_id, m.measurement_id, m.measurement_concept_id,m.measurement_date, m.unit_concept_id, m.unit_concept_name, m.value_as_number from omop.measurement m where m.measurement_id > 200000000")

yield row

','

with row

where row.unit_concept_id is not null

and row.value_as_number is not null

match (p:Person {person_id:row.person_id})

match (m:Measurement {measurement_concept_id: row.measurement_concept_id})

merge (p)-[r:HAS_MEASUREMENT {measurement_id:row.measurement_id}]->(m)

on create

set

r.measurement_date=row.measurement_date,

r.unit_concept_id=row.unit_concept_id,

r.unit_concept_name=row.unit_concept_name,

r.value_as_number=row.value_as_number

',{batchSize:100000, iterateList:True, parallel:False, params:{azure_url:$azure_url}, retries:5

}

);
1 ACCEPTED SOLUTION

Thank you all for the nice suggestions!

just wanted to update that this was solved by testing using neo4j-admin import - Import - Operations Manual

loaded super fast compared to apoc.load.jdbc or apoc.load csv as below

IMPORT DONE in 29s 560ms. 
Imported:
  1944 nodes
  12987721 relationships
  45588552 properties
Peak memory usage: 76.06MiB

View solution in original post

7 REPLIES 7

anthapu
Graph Fellow

apoc.load.jdbc is bit difficult to control and continue in case of failure.

There are 2 options.

  1. Use apache hop. It can stream the data from your database and write to neo4j without timeouts..
  2. If you comfortable with Python, you can export the csv file to local file system and use this Python ingest utility pyingest to write the data to neo4j.

Advantage of this approach is that you can start from the place of failure so that you don't need to restart the whole process. You can fix the cypher and continue.

Thanks a lot @anthapu for the suggestions.

I haven't tried apache hop yet but will look into that. as for csv, I don't think it can store more than ~1million rows so might need to use txt (not sure if neo4j supports) or json.

You could load json into neo4j.
The Python ingest utility GitHub - neo4j-field/pyingest can ingest JSON also into neo4j

I think one can export csv in sql server in manageable volumes and write a cutom ETL pipeline to ingest data into Neo4j rather than relying on ready made open source libraries until you are sure of its design goals.

Thank you all for the nice suggestions!

just wanted to update that this was solved by testing using neo4j-admin import - Import - Operations Manual

loaded super fast compared to apoc.load.jdbc or apoc.load csv as below

IMPORT DONE in 29s 560ms. 
Imported:
  1944 nodes
  12987721 relationships
  45588552 properties
Peak memory usage: 76.06MiB

@mengjiakang2017
neo4j-admin import will be faster than apoc.load.jdbc but neo4j-admin import can only operate on a new/empty database. You can not run neo4j-admin import against an existing/populated database. It is also run against a offline database. Whereas apoc.load.jdbc behaves in the exact opposite manner

Yes. I agree that apoc is great for database updates while neo4j-admin works well for database building from the ground. I haven't checked apache hop but it might be a good candidate for large data import based on Anthapu's suggestion.