I finally got around to spending some more time on this project today and found some success that I wanted to share for anyone else who may come across this post in their education.
Neo4j Spark Documentation - helpful starting point
Download the latest release of the connector from GitHub and upload it to an S3 bucket.
I also downloaded the GraphFrames jar and uploaded it to the S3 bucket
AWS Glue Job
I made a Scala job because that's what the examples are written in (To Do: figure out the python equivalent)
Dependent Jars
include the two jars comma separated
Parameters
This was the tricky part, AWS only lets you specify the a key once. They also don't encourage you to pass in --conf settings but that's how Neo4j wants the connection parameters. Specify a --conf
key and the value I just kept on specifying more confgs like this: spark.neo4j.bolt.url=bolt://mydomain.com:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password'
. The Neo4j documentation says you can combine the user & password all part of the URL parameter but I never could get this work.
That's what you need as far as specifying the Glue Job. Then for the actual code of my job I did just a very basic select and print results.
import com.amazonaws.services.glue.ChoiceOption
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.ResolveSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
import org.neo4j.spark._
import org.graphframes._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val neo = Neo4j(spark)
val graphFrame = neo.pattern(("Person","id"),("KNOWS",null), ("Person","id")).partitions(3).rows(1000).loadGraphFrame
graphFrame.vertices.show()
}
}
Most of this is just the boilerplate that AWS provides when making a new scala job. Not knowing scala and being new to spark in general, it took me some trial and error to get all this figured out. But it runs successfully and in the CloudWatch logs I can see values from my database printed!
Things still left to do
- Figure out how to do this python
- How to write and general manipulations with graphframes
- Manage connection information to be passed in as parameters
- What if you had to have two database connections how would you manage that