How to fetch millions of data faster?

Yes, it's on my personal computer which is running on 16G heap size and 1G is the page cache size.

Thank you for your time. I am planning to use python as the client program. Even though I increased the heap it didn't help to process this query.

So I spend two hours working through your query, after all it is a gigantic full graph query that basically touches almost all entities in the graph.

I did a bunch of optimizations:

  • 9G page-cache, 8G heap
  • store the delta directly in the node (as numeric value)
  • use a label for tagging active bureaus
  • load the installment with their delta before expanding
  • turn the optional match into a pattern comprehension
  • store the status in the list so you save the 2nd expand to bureaus
  • and then just filter the list for the status

for doing the initial updates:

call apoc.periodic.iterate("MATCH (i:Installments_Payment) RETURN i",
"SET i.DaysInstallMinusEntry = toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT)"
,{batchSize:10000,iterateList:true,parallel:true,concurrency:12});

// takes 88s

call apoc.periodic.iterate("MATCH (bureau:Bureau) WHERE bureau.CREDIT_ACTIVE = 'Active' RETURN bureau",
"SET bureau:Active"
,{batchSize:10000,iterateList:true,parallel:true,concurrency:12});
// takes 12s

The runtime on my computer is between 5 and 7 minutes, here is the final result:

from neo4j.v1 import GraphDatabase
import time


driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test"))

statement = """
MATCH (i:Installments_Payment) 
WITH i, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry
MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i) 

WITH a, DaysInstallMinusEntry, count(i) AS TotalInstallments
WITH a,DaysInstallMinusEntry,TotalInstallments, [ (a)-[:HAS_BUREAU]->(bureau) | bureau:Active] as bureauStatus

WITH DaysInstallMinusEntry, TotalInstallments, 
     size([status in bureauStatus WHERE status]) AS TotalActiveBureau, 
     size(bureauStatus) as TotalBureau, a

RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau, 	
	       toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio 
ORDER BY ActiveBureauByTotalBureauRatio DESC
"""

start = time.time()
rows = 0
value = 0
for record in driver.session().run(statement):
    rows = rows + 1
    value = value + record["ActiveBureauByTotalBureauRatio"]

print("rows {rows}".format(rows=rows))
print("took {time}".format(time=time.time() - start))


rows 5339018
took 450.4561972618103 -> 7 minutes

here are all the details

2 Likes

So, the black screen is because the query has actually finished and Browser is trying to retrieve back the resulted stream...which, being too big, crashes the Browser.
Hence i've added all results in grouping expression (count, sum, avg, etc) so that the result stream is 1 row.

Managed to run the query in 8+ minutes (AMd Threadripper 1900x, 8 cores X 3.9 (i've seen that it's actually running on a single processor which is pumped to 4Ghz

) / 32gb ram 3200 CL14, m.2 NVME)

Trick is to run outside of index scans, meaning:
match(a:Application)
with collect(a) as aa
unwind aa as a
MATCH (a:Application)-[:HAS_PREVIOUS]....

I actually wonder how much time does it take to output the entire results.

Neo4j browser is not meant to retrieve millions of rows.

See my code, it takes 7-8 minutes from python.

You just need to stream and write the data somewhere or analyze it further e.g. with pandas.

Yes, it is currently a single processor query, this might change in the future with a newer runtime.

1 Like

Your "trick" has no effect here.

If you look at the profile you don't save any db-hits.

Michael, Erika, what kind of operation is this: "with ... [(a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus" ? Haven't seen it before

Different approach, worse timings - 11 min

MATCH (a:Application) with a
CALL apoc.path.subgraphAll(a,{relationshipFilter:'HAS_PREVIOUS|HAS_PREV_INSTALLMENTS|HAS_BUREAU'}) YIELD nodes, relationships
with a,[ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active" and bureau in filter(n in nodes where n:Bureau)] as bureauStatus,
filter(n in nodes where n:Installments_Payment) as ii
WITH size([status in bureauStatus WHERE status]) as TotalActiveBureau,ii,bureauStatus,a
unwind ii as i
with size(bureauStatus) as TotalBureau, a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,count(i) AS TotalInstallments,TotalActiveBureau
RETURN count(a.SK_ID_CURR) AS ApplicationID, avg(TotalInstallments), avg(DaysInstallMinusEntry), avg(TotalBureau),
avg(toFloat(TotalActiveBureau)/toFloat(TotalBureau)) AS ActiveBureauByTotalBureauRatio
ORDER BY ActiveBureauByTotalBureauRatio DESC

it is a pattern comprehension, almost like a subquery, can contain patterns with new variables, where clauses and result expressions, results in a list of values

HI
i need to run simple match query
match(n:label)-[:has_example]->(m:label2) with n count(n.property) as c, collect(n.property) as c2 where c>1000
return m.property, c,c2 order by c desc
on 45 million nodes ...

and my server is not able to run ,.. how can i make it run via parallel or any other way..??

Please post a separate Question and then you can link it here.