I'm utilizing the 4.4.34 neo4j-community edition docker image. Attempting to incorporate limits to cypher queries using the pyspark connecter, I arrived to the following solution which mentions "query.count".
Incorporating this approach into my own method whilst utilizing the Movies database, I observed the following results.
I pass a query with "4" as the value for "partitions" and "1" for "query.limit". Printing the length of each "partition" I get the following output - 5 partition with a limit of 1 row per partition. However, my expected output for this exact query would be as follows - 4 partitions with a limit of 1 row per partition.
This behavior can be observed for a number of different values for both partition and query.count (attached more at the end of the page). Attaching a table below containing a number of tests for different "partitions" and "query.count" values.
partitions
limit
expected_partitions
expected rows/partitions
actual_partitions
actual rows/partitions
4
1
4
1
5
1
2
1
2
1
3
1
2
2
2
2
3
1
5
5
5
5
6
1
Could you kindly explain to me whats happening and why I'm not getting the expected results.
EDIT: query.count can actually be a long or a query, my bad
Hello, query.count must be a Cypher query that returns an integer (e.g. RETURN 42 or something more dynamic), as documented towards the end of the following section.
Can you change your query.count setting and see what happens?
Hello, sorry, I initially misread the question a bit.
For custom queries with query.count, it seems the connector purposefully adds an extra partition.
I'm honestly not 100% sure why.
Regarding the row/partition, it seems query.count is interpreted as a global limit, not a limit per partition.
If you have, say, 3 partitions and want 3 rows per partition, then query.count should be set to 9. I'll see if we can make the documentation clearer.
Update: I went ahead and submitted a fix for the partition and hopefully a docs clarification around custom counts.
Thank you for your reply and for making the modifications!
Can you please let me know where the fix is going to be deployed and also could you kindly provide the related documentation for "query.limit".
FYI: I updated my neo4j-spark jar to the latest version - 5.3.2
Additionally, I replaced "query.count" in my implementation with "query.limit". However, my output seems to be different to what you have mentioned.
I pass a query with "3" as the value for "partitions" and "9" for "query.limit". Printing the length of each "partition" I get the following output - 3 partitions, 2 of the partitions contain 13 rows and the 3rd partition contains 12 rows. As you mentioned in your previous reply the output should be, 3 partitions with 3 rows per partition.
Another example, when using 2 partitions and 2 for query limit. the result I get is 2 partitions with 19 rows in each partition.
Also tested with "query.count" and observed the same behavior as mentioned in my opening message.
(query.count - does not return the count of the nodes as you described, it works as a query limit. Limiting the number of rows per each partition)
Sorry, query.limit was a typo, I meant query.count. Let me update my post above.
Back to the original issue:
with a custom query
and query.count to 9
and partitions to 3
The current implementation should create 4 partitions (until my fix is in) and each partition's query should yield 3 rows (except the extra one, which may remain unused).
I noticed your fix on GitHub and updated the related JAR file in order to incorporate your changes.
However, the issue continue to persist. Running a query with partitions set as 3 and query.count set as 9, I get 4 partitions for my output, each containing 3 rows.