Neo4j Spark Connector integration with pyspark

j.armanini · January 4, 2021, 9:10pm

Hi,
I'm trying to read nodes from my local neo4jdb for practice purposes by using pyspark and neo4j connector. I've already downloaded the last version of neo4j-connector-apache-spark (2.12) and integrated it in pyspark as explained in the repo [GitHub - neo4j-contrib/neo4j-spark-connector: Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs] at README.
However when I try to perform a read using:

spark.read.format("org.neo4j.spark.DataSource") \
  .option("url", "bolt://localhost:7687") \
  .option("authentication.basic.username", "neo4j") \
  .option("authentication.basic.password", "psw") \
  .option("labels", "Person") \
  .load() \
  .show()

I get the following error:

Blockquote
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Blockquote
I think it could be related the format string "org.neo4j.spark.DataSource", but don't know how to fix.

Thanks for your attention,

Justin

Joel · January 5, 2021, 9:08pm

My first thought. Have you double checked the Spark version (what you are using versus expected?)

Capability varies by Spark version, and major updates have breaking changes.

j.armanini · January 5, 2021, 10:25pm

First of all thanks for your help Joel.
As you suggested I checked the used and required versions of Spark:

since I'm using pyspark 3.0.1 doc that runs on scala 2.12, I use neo4j-connector-apache-spark_2.12-4.0.0.jar according to github
I've even tried to install pyspark 2.4.0 which runs on scala 2.11, in order to try another connector (neo4j-connector-apache-spark_2.11-4.0.0.jar)

Be that as it may in both cases I'm still getting the same error I report entirely:

Traceback (most recent call last):
  File "c:/Users/arman/Desktop/prova/sparkneo4jconn.py", line 14, in <module>
    spark.read.format("org.neo4j.spark.DataSource") \
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\pyspark\sql\readwriter.py", line 184, in load
    return self._df(self._jreader.load())
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
    return f(*a, **kw)
  File "C:\Users\arman\Desktop\prova\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
        at java.base/java.lang.ClassLoader.defineClass1(Native Method)
        at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
        at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
        at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:825)
        at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:723)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:646)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:604)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:168)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:576)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$3(DataSource.scala:653)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:653)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:606)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:168)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 27 more

Joel · January 5, 2021, 10:37pm

Sorry to hear that, I can't think of anything else, I'm laser focused on
"java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport" which seems to strongly suggest an issue with versions and/or pathing. On occasion any given error is a red herring (with Cobol every error is a red herring, but I digress)

I'd pursue this first, to rule it out.

I've been down in the version/pathing abyss (kid friendly word) with spark, and it can be a nightmare.

j.armanini · January 6, 2021, 10:23am

Likely I have made a mistake in adding the connector to pyspark.
Or Am I missing something such as drivers? I suspect I should add jdbc drivers but don't know how to do it

Could you please suggest me any guide or tutorial about how to set up properly pyspark in order to run neo4j connector?

Thanks again

conker84 · February 2, 2021, 9:19am

@j.armanini did you solve this? Is it still an issue? If so please confirm:

where is your spark env? Databricks, AWS, ...?
how you installed the connector?
the spark connector jar name
the spark version

Please lemme know so I'll try to help you.

(btw I'm in the team of the spark connector)

j.armanini · February 5, 2021, 11:49am

@conker84 Actually I pulled the same issue on github a month ago and you already helped me.
Not sure whether you remember it, but it was due to I was using pyspak 3.0.1 which wasn't supported yet at that moment. Thanks

conker84 · February 9, 2021, 11:13am

Oh I see, btw we're working on Spark 3.0 and we hope to get it ready soon

conker84 · February 12, 2021, 3:38pm

@j.armanini we just merged the PR about Spark 3.0 support if you want you can download a preview version from here:

palak.joshi · March 13, 2025, 9:21am

Hi @conker84 and All
Actually I am facing same issue right now while writting a simple pyspark code with neo4j integration for setup things.

getting error:
An error occurred while calling o272.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: org.neo4j.spark.DataSource. Please find packages athttps://spark.apache.org/third-party-projects.html`.

I have versions
scala version: 2.12.18
spark version: 3.4.0
neo4j version: 5.26.2
spark connector: org.neo4j:neo4j-connector-apache-spark_2.12:5.0.2_for_spark_3

Can you guide me on this?

Topic		Replies	Views
Error while reading from neo4j into spark Neo4j Graph Platform migrated	1	120	January 30, 2023
Error while reading from neo4j into spark Neo4j Graph Platform	0	258	April 21, 2021
Getting SparkClassNotFoundException Cypher py2neo , cypher	1	31	March 13, 2025
Neo4j-connector-spark issue:class not found Neo4j Graph Platform spark	0	362	June 19, 2023
Neo4j Spark Connector integration with pyspark Using Pycharm Ecosystem & Integrations	1	367	May 13, 2021

Neo4j Spark Connector integration with pyspark

Related topics