AWS Glue + Neo4j : Tutorials?

mike_r_black · October 28, 2019, 1:37am

I finally got around to spending some more time on this project today and found some success that I wanted to share for anyone else who may come across this post in their education.

Neo4j Spark Documentation - helpful starting point
Download the latest release of the connector from GitHub and upload it to an S3 bucket.
I also downloaded the GraphFrames jar and uploaded it to the S3 bucket

AWS Glue Job
I made a Scala job because that's what the examples are written in (To Do: figure out the python equivalent)
Dependent Jars
include the two jars comma separated
Parameters
This was the tricky part, AWS only lets you specify the a key once. They also don't encourage you to pass in --conf settings but that's how Neo4j wants the connection parameters. Specify a --conf key and the value I just kept on specifying more confgs like this: spark.neo4j.bolt.url=bolt://mydomain.com:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password'. The Neo4j documentation says you can combine the user & password all part of the URL parameter but I never could get this work.

That's what you need as far as specifying the Glue Job. Then for the actual code of my job I did just a very basic select and print results.

import com.amazonaws.services.glue.ChoiceOption
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.ResolveSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

import org.neo4j.spark._
import org.graphframes._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    
    val neo = Neo4j(spark)
    
    val graphFrame = neo.pattern(("Person","id"),("KNOWS",null), ("Person","id")).partitions(3).rows(1000).loadGraphFrame
    
    graphFrame.vertices.show()
  }
}

Most of this is just the boilerplate that AWS provides when making a new scala job. Not knowing scala and being new to spark in general, it took me some trial and error to get all this figured out. But it runs successfully and in the CloudWatch logs I can see values from my database printed!

Things still left to do

Figure out how to do this python
How to write and general manipulations with graphframes
Manage connection information to be passed in as parameters
What if you had to have two database connections how would you manage that

Topic		Replies	Views
Aws glue integration with neo4j Import / Export	1	927	September 30, 2020
AWS Glue - Loading data into neo4j Import / Export	0	405	September 21, 2020
Toolbelt Trifecta: Connecting to Neo4j with Java and AWS Lambda Neo4j Developer Blog Archive	0	344	February 8, 2022
Neo4j and Snowflake Cloud Database ETL-Tool	3	2167	February 10, 2021
Setup Neo4j with pyspark on Azure databricks Installation	1	825	July 8, 2019

August Summer Fun!

AWS Glue + Neo4j : Tutorials?

Related topics