ETL Tool vs APOC vs some other tools?

(I am cross-posting this question in the neo4j-users Slack #neo4j-etl channel.)

Hello everyone! I'm wondering whether some of you might be able to help me decide whether it would make more sense for me to turn to neo4j-ETL or APOC (or maybe some other option I haven't heard of?) for the task of keeping my MySQL db and neo4j db synchronized on my website.

My site is currently in alpha, but the landing page is available: OpenBazaar.OnTheBlockchain.com and one of the pages where I use neo4j graph visualization is also active: OpenBazaar.OnTheBlockchain.com/index.php/settings/grapevine

The site is hosted remotely. My MySQL database schema is relatively complex, with multiple tables including two tables for two different types of users, one table for listings (it is an ecommerce site and much of my data is scraped from a peer to peer network of vendors, their listings, and buyers), several tables for different types of ratings of users, a table for ratings of listings, and a table for posts. My hope is that whenever the SQL db is updated -- new users signed up on my site, new users or listings scraped from the p2p network (currently over one per hour and increasing), new ratings placed on my site -- the graph db will also be updated in real time. I may also want to run some cronjobs or scripts to check periodically (daily, perhaps) for any discrepancies and make repairs. Currently there are about 60,000 users and several hundred thousand listings. The actual sales and transactions are handled on the p2p network, not on my site, which will serve as a search engine and a place to record reviews of vendors and products.

So far I have run the ETL Tool successfully on Neo4j Desktop on my laptop and have connected to my database (actually a much smaller version of it) remotely, but I am far from an ETL expert and still not sure of what all it can do. I haven't yet looked into what APOC tools are available although I read that that is another option. But the most recent discussion of these tools that I can find is from a stackoverflow post by William Lyon from 2017 [1] and I am sure things have changed since then.

Any advice on where my next deep dive should be -- ETL, APOC or somewhere else? Thank you to everyone in advance!

David

[1] https://stackoverflow.com/questions/45902827/migrate-sql-server-data-to-neo4j-and-keep-data-in-sync-for-both-the-databases

I think it makes more sense to use a script with APOC or a client side script if you need fine grained control and custom import.

The ETL tool is more suited for an initial import, although you can use the command line version also repeatedly with a once generated mapping file.
And if you edit the mapping file you could even change the generated SQL statements to check an update-timestamp or a to-sync flag.

The command-line ETL tool also has a mode of using "cypher-shell" to execute a script to load the data, you could pick up that script which is saved into the import directory of your database.

1 Like

Thanks Michael. After playing with ETL I was thinking it's not fine grained enough for my needs so probably APOC is what I will investigate next.

I just ran across Neo4j Streams in the user guide. I've never heard of Kafka before so I'm reading about that and data streaming vs batch-oriented data processing. It looks like perhaps I could use Neo4j Streams to stream data from MySQL to my Neo4j database, so Kafka/Streams might be an alternative to APOC (or my own customs scripts); and I'm guessing I might pick Kafka/Streams over APOC if I anticipate a high data throughput and/or I need updates to be pretty much instantaneous; I'm also guessing Neo4j Streams and Kafka would be a bigger learning curve and overall pain in the ass to implement than APOC batch processing. Am I understanding things correctly? Any other tradeoffs I may be missing?

David

That's also a really good idea. You can send the data via debezium from MySQL (and other sources) and then decide on the Neo4j side how you want to integrate it into your graph.

Kafka setup can be done via docker compose or confluent cloud.
And then you can run the neo4j kafka plugin in kafka connect middleware.
it can either map data directly or via cypher statements or config into the graph, so that part should be pretty straightforward.

1 Like

Thanks - I'll probably look into that (debezium, kafka, etc).

You can write your own custom java program to connect to MYSQL DB engine and import the data in neo4j.You can plugin your jar in APOC Library as a custom plugin.Then you can either (manually or automatically) trigger the calling of APOC library from cron jobs or scripts of your choice.

1 Like