I'd love to hear how others are doing this. I've got several cql scripts that I use regularly, and I've build some of my own tools to handle things like injecting credentials/tokens to keep them secure.
I'm largely just wrapping the scripts within powershell, then scheduling the running of these using Windows task scheduler, and logging results using a node label: (:Cypherlogentry) (I know! - It's certainly not very sophisticated)
As I'm growing the data I'm working with, it's getting challenging to manage all my schedules (what's working, what got broken, etc).
How are others managing import/exports and your n4j/cypher processes?
I've explored some of the 3rd party (ETL) data management tools, but sometimes I think they are adding additional complexity, rather than helping me manage the ETL processes I've already designed.
Do I just need to buckle down and learn a tool like Pentaho Kettle?
You might be at that time like you mentioned of learning an ETL tool. Don't worry they're not that scary and actual make life a lot easier because they're designed for process flow, step completion dependencies, etc... Pentaho is fairly easy to pick up and run. Apache Airflow is a new comer to the game. There's dozens out there to choose from. At home I run pentaho because it's free and easily integrates with the Neo4j JDBC driver. I have a couple of articles on my personal blog if you need any help getting setup.
I've got Pentaho running, have my Advantage Database and Neo4j connection.
The query from Advantage is working, and I'm populating the parameters into the "Neo4j Cypher" step.
I'm just using the output to create a simple node & relationship:
MATCH (r:Recordstatus{state:$ACCOUNTSTATUS})
MERGE (a:Company {acctrecid:$RECID})
MERGE (a)-[:ACCOUNT_STATE]->(r)
When I preview The query works fine, but the Cypher fails, but without much helpful information on what I did wrong.
"Neo4j Cypher.0 - ERROR (version 8.3.0.0-371, build 8.3.0.0-371 from 2019-06-11 11.09.08 by buildguy) : Unexpected error"
How do you figure out what it doesn't like about the way you've written your transform... it is aggressively vague! ;)
I'd prefer to write CYPHER (more familiar to me) I had an old example working using "Graph Output" but the additional complexity of the models and mappings, makes it more challenging to use that method for me.
Made some progress, and now I'm trying to do some more advanced cypher within kettle.
I am trying to use the Neo4j Cypher kettle plugin (@matt.casters ) so I can make the parameters into a map that I can unwind. (and not have to deal with the indexed parameters, used one-at-a-time in the exact order challenge.
I've selected "Collect parameter values as map"
Name of values map list: companies.
If I simply do a test like this:
UNWIND $companies as c
RETURN c.RECID,c.COMPANY
;
I get 683 BLANK rows returned... The source is a simple Table Input (JDBC db query), and the preview there shows all the data as I expect...
I'd love to get some details on how you use Airflow to ingest data into Neo. There seem to be a few attempts at hooks, but the details are sketchy. Please share if you can.
@maquinsland We basically use a combination of the kubernetesPodOperator and the PythonOperator. We created a python package that acts like a wrapper around neo4j. It can convert pandas dataframes into cypher statements and uses the bolt protocol to ingest the data into neo4j.
Then we use our KubernetesPythonOperator (which is the KubernetesPodOperator with a python callable) to spin up a neo4j docker image pointing to a persistent volume for the data. The python function loads and inserts the data. The next airflow task can do the same with a different file/dataset.
This is a great conversation. I'm also at a point where I need to begin scaling my graph database in order to efficiently update millions of nodes and relationships. I wrote a couple of Python scripts that pull data from two different sources, cleans it up, and transforms it into a neo4j-admin import compliant format. The management team at my company is now coming up with different ideas to add more data to our graph so the number of nodes and relationships will continue to grow.
Kettle, Apache Airflow, and NiFi are some of the tools suggested in this thread. If I were to spend this weekend to drill down on any one of these tools, which one would you all recommend?
I'm tending to lean towards Apache Airflow based on what I hear from the industry and how popular it has become. I've never heard of Kettle or NiFi until this month.
Since I first wrote this post I spent quite a bit of time with Kettle (Pentaho). It's worked quite well so far, and I find it makes the transformations/jobs easier to share or collaborate with someone else, as they force you to build workflows that are easier to logically follow.
I also found I was able to build some fairly complex WebAPI integrations/ingestions into the graph database as well.
I really need to build some demos and post back here more often, there doesn't seem to be a lot of in depth neo4j/Pentaho examples available, and I could probably contribute to that now.