How to fetch millions of data faster?

erikazosi · September 7, 2018, 4:47am

Yes, it's on my personal computer which is running on 16G heap size and 1G is the page cache size.

erikazosi · September 7, 2018, 4:53am

Thank you for your time. I am planning to use python as the client program. Even though I increased the heap it didn't help to process this query.

michael.hunger · September 7, 2018, 8:13pm

So I spend two hours working through your query, after all it is a gigantic full graph query that basically touches almost all entities in the graph.

I did a bunch of optimizations:

9G page-cache, 8G heap
store the delta directly in the node (as numeric value)
use a label for tagging active bureaus
load the installment with their delta before expanding
turn the optional match into a pattern comprehension
store the status in the list so you save the 2nd expand to bureaus
and then just filter the list for the status

for doing the initial updates:

call apoc.periodic.iterate("MATCH (i:Installments_Payment) RETURN i",
"SET i.DaysInstallMinusEntry = toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT)"
,{batchSize:10000,iterateList:true,parallel:true,concurrency:12});

// takes 88s

call apoc.periodic.iterate("MATCH (bureau:Bureau) WHERE bureau.CREDIT_ACTIVE = 'Active' RETURN bureau",
"SET bureau:Active"
,{batchSize:10000,iterateList:true,parallel:true,concurrency:12});
// takes 12s

The runtime on my computer is between 5 and 7 minutes, here is the final result:

from neo4j.v1 import GraphDatabase
import time


driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test"))

statement = """
MATCH (i:Installments_Payment) 
WITH i, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry
MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i) 

WITH a, DaysInstallMinusEntry, count(i) AS TotalInstallments
WITH a,DaysInstallMinusEntry,TotalInstallments, [ (a)-[:HAS_BUREAU]->(bureau) | bureau:Active] as bureauStatus

WITH DaysInstallMinusEntry, TotalInstallments, 
     size([status in bureauStatus WHERE status]) AS TotalActiveBureau, 
     size(bureauStatus) as TotalBureau, a

RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau, 	
	       toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio 
ORDER BY ActiveBureauByTotalBureauRatio DESC
"""

start = time.time()
rows = 0
value = 0
for record in driver.session().run(statement):
    rows = rows + 1
    value = value + record["ActiveBureauByTotalBureauRatio"]

print("rows {rows}".format(rows=rows))
print("took {time}".format(time=time.time() - start))


rows 5339018
took 450.4561972618103 -> 7 minutes

here are all the details

gabriel_toma · September 8, 2018, 12:17pm

So, the black screen is because the query has actually finished and Browser is trying to retrieve back the resulted stream...which, being too big, crashes the Browser.
Hence i've added all results in grouping expression (count, sum, avg, etc) so that the result stream is 1 row.

Managed to run the query in 8+ minutes (AMd Threadripper 1900x, 8 cores X 3.9 (i've seen that it's actually running on a single processor which is pumped to 4Ghz

) / 32gb ram 3200 CL14, m.2 NVME)

Trick is to run outside of index scans, meaning:
match(a:Application)
with collect(a) as aa
unwind aa as a
MATCH (a:Application)-[:HAS_PREVIOUS]....

I actually wonder how much time does it take to output the entire results.

michael.hunger · September 8, 2018, 1:43pm

Neo4j browser is not meant to retrieve millions of rows.

See my code, it takes 7-8 minutes from python.

You just need to stream and write the data somewhere or analyze it further e.g. with pandas.

Yes, it is currently a single processor query, this might change in the future with a newer runtime.

michael.hunger · September 8, 2018, 1:43pm

Your "trick" has no effect here.

If you look at the profile you don't save any db-hits.

gabriel_toma · September 8, 2018, 2:09pm

Michael, Erika, what kind of operation is this: "with ... [(a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus" ? Haven't seen it before

Different approach, worse timings - 11 min

MATCH (a:Application) with a
CALL apoc.path.subgraphAll(a,{relationshipFilter:'HAS_PREVIOUS|HAS_PREV_INSTALLMENTS|HAS_BUREAU'}) YIELD nodes, relationships
with a,[ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active" and bureau in filter(n in nodes where n:Bureau)] as bureauStatus,
filter(n in nodes where n:Installments_Payment) as ii
WITH size([status in bureauStatus WHERE status]) as TotalActiveBureau,ii,bureauStatus,a
unwind ii as i
with size(bureauStatus) as TotalBureau, a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,count(i) AS TotalInstallments,TotalActiveBureau
RETURN count(a.SK_ID_CURR) AS ApplicationID, avg(TotalInstallments), avg(DaysInstallMinusEntry), avg(TotalBureau),
avg(toFloat(TotalActiveBureau)/toFloat(TotalBureau)) AS ActiveBureauByTotalBureauRatio
ORDER BY ActiveBureauByTotalBureauRatio DESC

michael.hunger · September 9, 2018, 12:08am

it is a pattern comprehension, almost like a subquery, can contain patterns with new variables, where clauses and result expressions, results in a list of values

yeshveer.yadav · October 10, 2019, 12:59pm

HI
i need to run simple match query
match(n:label)-[:has_example]->(m:label2) with n count(n.property) as c, collect(n.property) as c2 where c>1000
return m.property, c,c2 order by c desc
on 45 million nodes ...

and my server is not able to run ,.. how can i make it run via parallel or any other way..??

benjamin.squire · October 17, 2019, 6:01pm

Please post a separate Question and then you can link it here.

yeshveer.yadav · October 18, 2019, 4:59am

Topic		Replies	Views
Running out of heap memory size Cypher	8	1292	February 20, 2020
Recommended memory config for importing 10GB dataset with 16GB RAM Neo4j Graph Platform cypher	2	2939	September 8, 2020
Memory difference when configuring query_cache_size Cypher	4	385	February 23, 2021
RE: How to Aggregate calculation of data faster? Cypher	3	460	January 21, 2020
How to fetch data on from data base having around 60 million node of one type? Cypher performance , cypher	4	1151	November 12, 2019

How to fetch millions of data faster?

Related topics