What is the recommended method to update the results from spark back to Neo4j ?

spark

(Srj295) #1

I wanted to get some information on what is the recommended method to update the Neo4j graph node values after doing some processing on Apache Spark.
For my use case,I have a graph in Neo4j, I want to run a Spark Graphx's Pregel algorithm on the Neo4j graph data.
I am planning to use the Cypher For Apache Spark(CAPS) to read graph from neo4j and run Graphx's Pregel on it.
One possible way to do this as mentioned in the example Neo4jMergeExample.scala (https://github.com/opencypher/cypher-for-apache-spark/blob/master/spark-cypher-examples/src/main/scala/org/opencypher/spark/examples/Neo4jMergeExample.scala).
This requires us to create a copy of the original graph from Neo4j in the sparkSession.
Apply the updates on top of this copy and then use the Neo4jGraphMerge.merge() to update the graph back to Noe4j.
Is there a way to do this without creating a copy of the original Noe4j graph in the CAPS Session?

What is the recommended method to update the results from spark back to Neo4j ?


(Michael Hunger) #2

Do you run custom Algorithms in Spark or just built ones? Would love to hear more details.
For the built in ones you can also check out the graph algo library in Neo4j which should be faster than Spark Algos.

For writing back if you have some means of identiying the nodes again, you can also just instantiate the Neo4j Bolt Driver in your spark jobs and write the results back in batches, see:

That's what I also do in the neo4j-spark-connector.


(Srj295) #3

Hi Michael,

Thank you for the quick reply.
For my use case, I am using Cypher for Apache Spark (CAPS) to read in the graph from Neo4j into Spark,
run the custom Graphx's Pregel algorithm in spark to compute the score property of each vertices and then update the computed values back to Neo4j. I looked at the graph algorithm from Neo4j and also used the Neo4j's PageRank algorithm. This is very helpful.

I am trying to explore CAPS and see how best I can use it for my use case.
How does CAPS handle reading large graph in the spark job?. When I am reading a large graph from a remote Neo4j instance(Neo4j running on a machine outside Spark clusters) does it read the entire graph on a single node (master node) before it runs the spark transformations and actions ?


(Mats Rydberg) #4

Hello SRJ

Are you perhaps the same person as the GitHub user shyamrjoshi who also posted this GitHub issue?

In general you'll reach us (the CAPS team) faster if you post a GitHub issue, as we monitor these daily. Posting here is fine as well, but you'll likely experience a slightly longer delay.

How does CAPS handle reading large graph in the spark job?. When I am reading a large graph from a remote Neo4j instance(Neo4j running on a machine outside Spark clusters) does it read the entire graph on a single node (master node) before it runs the spark transformations and actions ?

It depends on what you are doing with the graph. CAPS will not load anything from the remote Neo4j server eagerly, except that it will call the loaded procedure in order to compute the schema of the Neo4j graph. The data that it loads is defined by what query or operation you do on the Neo4j graph; if you only query for :Person nodes, only :Person nodes will be loaded. If you query for all nodes, all nodes will be loaded.

The loading happens only at Spark action time; when you call .show() or .store() or some other action operation directly on the result DataFrame.


(Srj295) #5

Hi Mats,

Yes, that's me. I opened this issue on GitHub.
I will continue this thread on the GitHub issue.

Apologies for any inconvenience caused.

Thanks,
Shyam