I want to discuss regarding ETL, I have data from postgre to neo4j but taking lot of time, I have used databricks spark dataframe please suggest me which approach is more suitable?

I want to discuss regarding ETL, I have data from postgre to neo4j but taking lot of time, I have used databricks spark dataframe please suggest me which approach is more suitable?

1 Like

Is this a (A) “one time etl” job or rather a (B) “continuous synchronisation”?

For both, do as much of the transformation/de duplication in databricks (minimize redundant work on the database side). You are still writing with the overhead of “transactions” to a database, so find the right batch size and don’t expect any magic numbers (if you create 50-100k nodes per second - it is resonable, if you are way below that - you are probably missing a node key constraint/index or some other issue).

For (B) - depending on other workload, you may want to optimize for stability instead of speed/throughput (= reduce the batch size and reduce concurrency).