Neo4j Spark Connector integration with pyspark

Hi,
I'm trying to read nodes from my local neo4jdb for practice purposes by using pyspark and neo4j connector. I've already downloaded the last version of neo4j-connector-apache-spark (2.12) and integrated it in pyspark as explained in the repo [GitHub - neo4j-contrib/neo4j-spark-connector: Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs] at README.
However when I try to perform a read using:

spark.read.format("org.neo4j.spark.DataSource") \
  .option("url", "bolt://localhost:7687") \
  .option("authentication.basic.username", "neo4j") \
  .option("authentication.basic.password", "psw") \
  .option("labels", "Person") \
  .load() \
  .show()

I get the following error:

Blockquote
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Blockquote
I think it could be related the format string "org.neo4j.spark.DataSource", but don't know how to fix.

Thanks for your attention,

Justin

My first thought. Have you double checked the Spark version (what you are using versus expected?)

Capability varies by Spark version, and major updates have breaking changes.

First of all thanks for your help Joel.
As you suggested I checked the used and required versions of Spark:

  • since I'm using pyspark 3.0.1 doc that runs on scala 2.12, I use neo4j-connector-apache-spark_2.12-4.0.0.jar according to github

  • I've even tried to install pyspark 2.4.0 which runs on scala 2.11, in order to try another connector (neo4j-connector-apache-spark_2.11-4.0.0.jar)

Be that as it may in both cases I'm still getting the same error I report entirely:

Traceback (most recent call last):
  File "c:/Users/arman/Desktop/prova/sparkneo4jconn.py", line 14, in <module>
    spark.read.format("org.neo4j.spark.DataSource") \
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\pyspark\sql\readwriter.py", line 184, in load
    return self._df(self._jreader.load())
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
    return f(*a, **kw)
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
        at java.base/java.lang.ClassLoader.defineClass1(Native Method)
        at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
        at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
        at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:825)
        at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:723)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:646)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:604)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:168)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:576)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$3(DataSource.scala:653)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:653)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:606)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:168)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 27 more

Sorry to hear that, I can't think of anything else, I'm laser focused on
"java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport" which seems to strongly suggest an issue with versions and/or pathing. On occasion any given error is a red herring (with Cobol every error is a red herring, but I digress)

I'd pursue this first, to rule it out.

I've been down in the version/pathing abyss (kid friendly word) with spark, and it can be a nightmare.

Likely I have made a mistake in adding the connector to pyspark.
Or Am I missing something such as drivers? I suspect I should add jdbc drivers but don't know how to do it

Could you please suggest me any guide or tutorial about how to set up properly pyspark in order to run neo4j connector?

Thanks again

@j.armanini did you solve this? Is it still an issue? If so please confirm:

  • where is your spark env? Databricks, AWS, ...?
  • how you installed the connector?
  • the spark connector jar name
  • the spark version

Please lemme know so I'll try to help you.

(btw I'm in the team of the spark connector)

@conker84 Actually I pulled the same issue on github a month ago and you already helped me.
Not sure whether you remember it, but it was due to I was using pyspak 3.0.1 which wasn't supported yet at that moment. Thanks

Oh I see, btw we're working on Spark 3.0 and we hope to get it ready soon

@j.armanini we just merged the PR about Spark 3.0 support if you want you can download a preview version from here: