I'm loading data from Azure SQL server to neo4j hosted on docker in a Linux server. one of the table that includes ~7 million rows cannot be loaded in one run and kept causing connection timeout issue.
can someone help on configuring the timeout during the JDBC connection?
below is my connection string and import query -
:param azure_url => 'jdbc:sqlserver://SERVER;databaseName=DATABASE;user=User;password=PW';
CALL apoc.periodic.iterate('call apoc.load.jdbc($azure_url, "select distinct m.person_id, m.measurement_id, m.measurement_concept_id,m.measurement_date, m.unit_concept_id, m.unit_concept_name, m.value_as_number from omop.measurement m where m.measurement_id > 200000000")
yield row
','
with row
where row.unit_concept_id is not null
and row.value_as_number is not null
match (p:Person {person_id:row.person_id})
match (m:Measurement {measurement_concept_id: row.measurement_concept_id})
merge (p)-[r:HAS_MEASUREMENT {measurement_id:row.measurement_id}]->(m)
on create
set
r.measurement_date=row.measurement_date,
r.unit_concept_id=row.unit_concept_id,
r.unit_concept_name=row.unit_concept_name,
r.value_as_number=row.value_as_number
',{batchSize:100000, iterateList:True, parallel:False, params:{azure_url:$azure_url}, retries:5
}
);
apoc.load.jdbc is bit difficult to control and continue in case of failure.
There are 2 options.
Use apache hop. It can stream the data from your database and write to neo4j without timeouts..
If you comfortable with Python, you can export the csv file to local file system and use this Python ingest utility pyingest to write the data to neo4j.
Advantage of this approach is that you can start from the place of failure so that you don't need to restart the whole process. You can fix the cypher and continue.
I haven't tried apache hop yet but will look into that. as for csv, I don't think it can store more than ~1million rows so might need to use txt (not sure if neo4j supports) or json.
I think one can export csv in sql server in manageable volumes and write a cutom ETL pipeline to ingest data into Neo4j rather than relying on ready made open source libraries until you are sure of its design goals.
@mengjiakang2017 neo4j-admin import will be faster than apoc.load.jdbc but neo4j-admin import can only operate on a new/empty database. You can not run neo4j-admin import against an existing/populated database. It is also run against a offline database. Whereas apoc.load.jdbc behaves in the exact opposite manner
Yes. I agree that apoc is great for database updates while neo4j-admin works well for database building from the ground. I haven't checked apache hop but it might be a good candidate for large data import based on Anthapu's suggestion.