I am using Neo4J community version 3.2.2. I am trying to identify synthetic identity fraud with the following data. I have nodes with 4 major labels - :Person, :Email, :Phone, :Identifier. Every :Person node also has either :Fraud or :Non-Fraud label. There are in all 150 million person nodes out of which 1.1 million nodes are fraud nodes. The data is confidential. I need to find the number of neighbour nodes for every :Person node. I run the following query in cypher-shell and pass its output to a csv.
profile MATCH (n:Person) return n.fid, size((n)--());
fid is an id property of :Person node. It runs like a shell process endlessly. However, if I run this similar query on the smaller label :Fraud, it runs in 15 secs.
profile MATCH (n:Fraud) return n.fid, size((n)--());
If we see the profile of this query, we will see that this query is linear in scale if db hits and rows are considered. Thus, it is expected that the query on :Person label should take about 2000 secs which is less than an hour. Thus, it seems the problem is due to the size of the output.
Max and initial heap size are set to 30 GB and pagecache size is left commented which I assume would be 50% RAM minus max heap size(0.5*500 - 30 = 220GB). Also, this is not a dedicated neo4J server. Please correct me if I am wrong anywhere and help me with this problem.
Thanks and Regards,