Returning millions of rows from cypher-shell taking too long

kevinvivk · April 11, 2020, 12:34pm

Hello,

I am using Neo4J community version 3.2.2. I am trying to identify synthetic identity fraud with the following data. I have nodes with 4 major labels - :Person, :Email, :Phone, :Identifier. Every :Person node also has either :Fraud or :Non-Fraud label. There are in all 150 million person nodes out of which 1.1 million nodes are fraud nodes. The data is confidential. I need to find the number of neighbour nodes for every :Person node. I run the following query in cypher-shell and pass its output to a csv.

profile MATCH (n:Person) return n.fid, size((n)-[]-());

fid is an id property of :Person node. It runs like a shell process endlessly. However, if I run this similar query on the smaller label :Fraud, it runs in 15 secs.

profile MATCH (n:Fraud) return n.fid, size((n)-[]-());

If we see the profile of this query, we will see that this query is linear in scale if db hits and rows are considered. Thus, it is expected that the query on :Person label should take about 2000 secs which is less than an hour. Thus, it seems the problem is due to the size of the output.
Max and initial heap size are set to 30 GB and pagecache size is left commented which I assume would be 50% RAM minus max heap size(0.5*500 - 30 = 220GB). Also, this is not a dedicated neo4J server. Please correct me if I am wrong anywhere and help me with this problem.

Thanks and Regards,
Kevin Kunnapilly

stefan.armbruster · April 11, 2020, 1:03pm

My suspicion is that your query is using the compiled runtime. This implementation is pretty fast but does materialize the results - something you want to avoid when returning 150M rows.

Therefore you need to force a different runtime implementation. On enterprise edition you would choose slotted. On community there's no slotted, so interpreted is the next best choice.

To do so, prefix your statement:

cypher runtime=interpreted MATCH (n:Person) return n.fid, size((n)-[]-());

I guess this will perform much better - looking forward to hear your feedback.

On a different notice: 3.2.2 is pretty much outdated, please consider a upgrade.

kevinvivk · April 12, 2020, 6:55pm

Hi Stefan,

Thank you for your speedy reply. It worked wonderfully and the query ran in under 1100 secs as expected.

Sorry for the extra trouble but I have two more questions-

I have some more complex queries for every node like number of people directly connected(at level 2, person-id-person)(Say 50 to 1000 times more db hits as compared to previous query). So would it be possible for me to tweak certain conf file properties which would give me faster performance? Or worst case, if I need a server or cluster with greater resources would what be an ideal value of RAM and other properties for about 0.5 to 1.5 billion nodes?
Does forcing the runtime to interpreted mean that 'you run the query and it will keep running until you get the result or you kill it'? If possible, could you provide some insight into compiled vs interpreted runtime?

Thanks and Regards,
Kevin Kunnapilly

stefan.armbruster · April 12, 2020, 7:19pm

The main difference between compiled and interpreted runtime is that the first processes the query serverside and collects all the results into in-memory data structures. Once finished, results are streamed to the client. Interpreted runtime can stream directly.
Note that things have changed in more recent version. Therefore retry your statement on a up-to-date release without prefixing a runtime implementation.

If you use a runtime that streams directly the memory overhead of that query will be close to 0.

kevinvivk · April 12, 2020, 7:27pm

Hi Stefan,

I have switched to 3.5.17 as of now. 4.0 wasn't possible on my current server due to outdated Java. Will that work?
Also please do advise me on server side optimisations if any or RAM advisable for my current situation(Around 0.5 to 1.5 billion nodes or about 50 to 1000 times more db hits in a query).

Thanks,
Kevin

stefan.armbruster · April 12, 2020, 7:54pm

Compiled runtime has been removed in 4.0, see Deprecations, additions, and compatibility - Cypher Manual - so you'll still have it in 3.5.

Precise sizing estimations is normally the result of a workshop. You can read up on Performance - Operations Manual for the basics on this.

Topic		Replies	Views
Query running endlessly for large input file Cypher	7	391	April 23, 2020
Poor Performance on Consuming/Returning Millions of Rows Cypher performance , cypher	2	419	January 18, 2021
Neo4j is very slow on 376m nodes 2.8b relations database Cypher	1	237	March 5, 2021
Displaying millions of lines of results Newbie Questions	9	2734	June 22, 2019
A slow running cypher query Cypher performance	8	5149	February 10, 2020

Returning millions of rows from cypher-shell taking too long

Related topics