How do you schedule and manage data integration with Neo4j?

I'd love to hear how others are doing this. I've got several cql scripts that I use regularly, and I've build some of my own tools to handle things like injecting credentials/tokens to keep them secure.

I'm largely just wrapping the scripts within powershell, then scheduling the running of these using Windows task scheduler, and logging results using a node label: (:Cypherlogentry) (I know! - It's certainly not very sophisticated)

As I'm growing the data I'm working with, it's getting challenging to manage all my schedules (what's working, what got broken, etc).

How are others managing import/exports and your n4j/cypher processes?

I've explored some of the 3rd party (ETL) data management tools, but sometimes I think they are adding additional complexity, rather than helping me manage the ETL processes I've already designed.

Do I just need to buckle down and learn a tool like Pentaho Kettle?

You might be at that time like you mentioned of learning an ETL tool. Don't worry they're not that scary and actual make life a lot easier because they're designed for process flow, step completion dependencies, etc... Pentaho is fairly easy to pick up and run. Apache Airflow is a new comer to the game. There's dozens out there to choose from. At home I run pentaho because it's free and easily integrates with the Neo4j JDBC driver. I have a couple of articles on my personal blog if you need any help getting setup.

1 Like

Hi Paul,

I mostly use Apache Airflow to ingest data or build new neo4j databases from scratch. If you need help feel free to ask.

Cheers Kris

2 Likes

Thanks for the starter info!

I've got Pentaho running, have my Advantage Database and Neo4j connection.
The query from Advantage is working, and I'm populating the parameters into the "Neo4j Cypher" step.
I'm just using the output to create a simple node & relationship:

MATCH (r:Recordstatus{state:$ACCOUNTSTATUS})
MERGE (a:Company {acctrecid:$RECID})
MERGE (a)-[:ACCOUNT_STATE]->(r)

When I preview The query works fine, but the Cypher fails, but without much helpful information on what I did wrong.

"Neo4j Cypher.0 - ERROR (version 8.3.0.0-371, build 8.3.0.0-371 from 2019-06-11 11.09.08 by buildguy) : Unexpected error"

How do you figure out what it doesn't like about the way you've written your transform... it is aggressively vague! ;)

I'd prefer to write CYPHER (more familiar to me) I had an old example working using "Graph Output" but the additional complexity of the models and mappings, makes it more challenging to use that method for me.

Thanks!

Made some progress, and now I'm trying to do some more advanced cypher within kettle.

I am trying to use the Neo4j Cypher kettle plugin (@matt.casters ) so I can make the parameters into a map that I can unwind. (and not have to deal with the indexed parameters, used one-at-a-time in the exact order challenge.

I've selected "Collect parameter values as map"
Name of values map list: companies.

If I simply do a test like this:

UNWIND $companies as c
RETURN c.RECID,c.COMPANY
;

I get 683 BLANK rows returned... The source is a simple Table Input (JDBC db query), and the preview there shows all the data as I expect...

If I use a MERGE statement, it errors complaining about null values...

Any ideas?

I am super happy with Nifi and Neo4j. You should take a look.

Hi Kris,

I'd love to get some details on how you use Airflow to ingest data into Neo. There seem to be a few attempts at hooks, but the details are sketchy. Please share if you can.

thanks
mark

@maquinsland We basically use a combination of the kubernetesPodOperator and the PythonOperator. We created a python package that acts like a wrapper around neo4j. It can convert pandas dataframes into cypher statements and uses the bolt protocol to ingest the data into neo4j.
Then we use our KubernetesPythonOperator (which is the KubernetesPodOperator with a python callable) to spin up a neo4j docker image pointing to a persistent volume for the data. The python function loads and inserts the data. The next airflow task can do the same with a different file/dataset.

1 Like

Hi Kris,

I am new to Neo4j and Airflow and i am working on the same scenario task, ingesting data to neo4j.
It will be help full for me if you can share code.

Thanks in advance.
Karthik

Hi All,

This is a great conversation. I'm also at a point where I need to begin scaling my graph database in order to efficiently update millions of nodes and relationships. I wrote a couple of Python scripts that pull data from two different sources, cleans it up, and transforms it into a neo4j-admin import compliant format. The management team at my company is now coming up with different ideas to add more data to our graph so the number of nodes and relationships will continue to grow.

Kettle, Apache Airflow, and NiFi are some of the tools suggested in this thread. If I were to spend this weekend to drill down on any one of these tools, which one would you all recommend?

I'm tending to lean towards Apache Airflow based on what I hear from the industry and how popular it has become. I've never heard of Kettle or NiFi until this month.

What do you all think is the best approach?

Thanks,
Tony

Tony, what did you end up going with?

Since I first wrote this post I spent quite a bit of time with Kettle (Pentaho). It's worked quite well so far, and I find it makes the transformations/jobs easier to share or collaborate with someone else, as they force you to build workflows that are easier to logically follow.

I also found I was able to build some fairly complex WebAPI integrations/ingestions into the graph database as well.

I really need to build some demos and post back here more often, there doesn't seem to be a lot of in depth neo4j/Pentaho examples available, and I could probably contribute to that now.

@krisgeus Super interested in learning about using Airflow DAGs for Neo4j ingestion. Can you point me to any blogs/docs?

Basically my data has lots of different dependencies where order of ingestion matters and I'd like to intelligently schedule these sync jobs.