Unexpected LIMIT results with PySpark connector

andreisacal · September 20, 2024, 2:30pm

Hello there,

I'm utilizing the 4.4.34 neo4j-community edition docker image. Attempting to incorporate limits to cypher queries using the pyspark connecter, I arrived to the following solution which mentions "query.count".

Incorporating this approach into my own method whilst utilizing the Movies database, I observed the following results.

This is the command that I'm using.

       df = spark.read.format("org.neo4j.spark.DataSource") \
            .option("authentication.basic.username", "neo4j") \
            .option("authentication.basic.password", "password") \
            .option("url", "http://localhost:7687") \
            .option("query","MATCH (m:Movie) RETURN m" ) \
            .option("partitions", partitions) \
            .option("query.count", limit) \
            .load()

I pass a query with "4" as the value for "partitions" and "1" for "query.limit". Printing the length of each "partition" I get the following output - 5 partition with a limit of 1 row per partition. However, my expected output for this exact query would be as follows - 4 partitions with a limit of 1 row per partition.

This behavior can be observed for a number of different values for both partition and query.count (attached more at the end of the page). Attaching a table below containing a number of tests for different "partitions" and "query.count" values.

partitions	limit	expected_partitions	expected rows/partitions	actual_partitions	actual rows/partitions
4	1	4	1	5	1
2	1	2	1	3	1
2	2	2	2	3	1
5	5	5	5	6	1

Could you kindly explain to me whats happening and why I'm not getting the expected results.

Thanks!

florent_biville · September 20, 2024, 2:45pm

EDIT: query.count can actually be a long or a query, my bad

Hello, query.count must be a Cypher query that returns an integer (e.g. RETURN 42 or something more dynamic), as documented towards the end of the following section.
Can you change your query.count setting and see what happens?

andreisacal · September 20, 2024, 2:57pm

The article you mention is for getting the count of the query result. However I'm trying to limit the result returned by the query output as discussed in the following GitHub issue: LIMIT doesn't work with custom read cypher query · Issue #198 · neo4j/neo4j-spark-connector · GitHub

florent_biville · September 23, 2024, 8:45am

Hello, sorry, I initially misread the question a bit.

For custom queries with query.count, it seems the connector purposefully adds an extra partition.
I'm honestly not 100% sure why.

Regarding the row/partition, it seems query.count is interpreted as a global limit, not a limit per partition.
If you have, say, 3 partitions and want 3 rows per partition, then query.count should be set to 9. I'll see if we can make the documentation clearer.

Update: I went ahead and submitted a fix for the partition and hopefully a docs clarification around custom counts.

andreisacal · September 23, 2024, 11:01am

Hello Florent,

Thank you for your reply and for making the modifications!

Can you please let me know where the fix is going to be deployed and also could you kindly provide the related documentation for "query.limit".

FYI: I updated my neo4j-spark jar to the latest version - 5.3.2

Additionally, I replaced "query.count" in my implementation with "query.limit". However, my output seems to be different to what you have mentioned.

I pass a query with "3" as the value for "partitions" and "9" for "query.limit". Printing the length of each "partition" I get the following output - 3 partitions, 2 of the partitions contain 13 rows and the 3rd partition contains 12 rows. As you mentioned in your previous reply the output should be, 3 partitions with 3 rows per partition.

Another example, when using 2 partitions and 2 for query limit. the result I get is 2 partitions with 19 rows in each partition.

Also tested with "query.count" and observed the same behavior as mentioned in my opening message.

(query.count - does not return the count of the nodes as you described, it works as a query limit. Limiting the number of rows per each partition)

Thanks!

florent_biville · September 23, 2024, 3:06pm

Sorry, query.limit was a typo, I meant query.count. Let me update my post above.

Back to the original issue:

with a custom query
and query.count to 9
and partitions to 3

The current implementation should create 4 partitions (until my fix is in) and each partition's query should yield 3 rows (except the extra one, which may remain unused).

andreisacal · October 7, 2024, 3:57pm

Hey, thanks for the update!

I noticed your fix on GitHub and updated the related JAR file in order to incorporate your changes.

However, the issue continue to persist. Running a query with partitions set as 3 and query.count set as 9, I get 4 partitions for my output, each containing 3 rows.

Thanks!

Topic		Replies	Views
Streaming and Batching Results of a Query ETL-Tool apoc , performance , cypher	0	289	December 5, 2023
Nodejs neo4j-driver Trying to read and return result set of over 4m records Drivers & Stacks	15	3860	August 11, 2020
A very newbie question: how does "LIMIT" in Cypher limit results? Neo4j Graph Platform migrated	2	303	June 27, 2022
How to get cypher query result in python Cypher cypher	2	499	September 3, 2021
Neo4j Spark Connector: write in through query Cypher spark , cypher	4	615	February 10, 2023

July Summer Fun!

Unexpected LIMIT results with PySpark connector

Related topics