I am using the dataset from Kaggle.
As I had received an error prompting to increase the heap memory size,I assigned heap memory size of 16GB and page cache memory size of 1G, as I have total of 32GB RAM. I left this query to run for an hour but I didn't get any result so I terminated the process.
OPTIONAL MATCH(a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i:Installments_Payment) WITH
toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,
COUNT(i) AS TotalInstallments
OPTIONAL MATCH(a)-[:HAS_BUREAU]->(bureau:Bureau)
WITH size(FILTER(x IN collect(bureau.CREDIT_ACTIVE) WHERE x="Active")) AS TotalActiveBureau, a.SK_ID_CURR AS ApplicationID,COUNT(bureau) AS TotalBureau, TotalInstallments,DaysInstallMinusEntry
RETURN ApplicationID,TotalInstallments,DaysInstallMinusEntry,TotalBureau,
toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio ORDER BY ActiveBureauByTotalBureauRatio DESC
editing in progress, seems that function signature has changed since i last used it
Hello,
You need to put your query between " ", like call apoc.cypher.parallel("match (n) return n limit 1")
What about profile? can you make a print-screen or an export (sometimes, the export does not work)
I would also save all calculations as new attributes within nodes, not do them inside the query. You can also use virtual nodes if you don't want to ruin your graph, meaning: do the calculation and create new virtual nodes for each of your nodes, then do the query on the resulting graph.
Do you have your database available somewhere for download or testing?
How big is the db in terms of nodes and relationships?
You can use EXPLAIN instead of profile to get an idea of the query complexity.
Please format your query properly next time. It's not polite to expect others to make sense of a mess of code.
Usually, you would write such a query step by step, always having a look at PROFILE and EXPLAIN and often just returning the total count(*) from a pattern so that you always have a good understanding of the cost.
Tips: Turn this into a label: bureau.CREDIT_ACTIVE = "Active"
Problem: You missed a in the first WITH.
I formatted and simplified it:
replaced optional match with pattern comprehension which computes all status as list of true/false
save second count by just filtering that list
added a to the first `WITH``
only access properties as late as possible
Here is the statement:
MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i:Installments_Payment)
WITH a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,
count(i) AS TotalInstallments
WITH *, [ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus
WITH DaysInstallMinusEntry, TotalInstallments,
size([status in bureauStatus WHERE status]) TotalActiveBureau,
size(bureauStatus) as TotalBureau, a
RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau,
toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio
ORDER BY ActiveBureauByTotalBureauRatio DESC
Hi, thank you for the detailed information. I apologize for the wrongly formatted query as I am new to the complex queries.
I have total of 52695506 NODES and 52339251 relationships.
I will also make the database available for download and provide you the link.
The reason I used OPTIONAL MATCH is because the application node may or may not have relationship with bureau node and previous_application node with installments_payment.
The result of using EXPLAIN on the query I mentioned with my question:
This dataset is not intended for processing in a graph database, as it is= a minimal hierarchy with a lot of non-hierarchy related data superimposed over and over again, million of times.
But that's it, there's no connection between these hierarchies, or at least, not yet. @michael.hunger - would it be better for memory consumption, to break a node with a lot of attributes into multiple nodes, so that the node loaded in memory contains only the minimal set of data?
As per what it seems, the model does not fit into Ram.
Secondly, is there any benefit of he links all categorical values, instead of Keeping them as attributes?
And third, but related to 2 and extended: would a value graph help ?
Not sure how many attributes are there.
Sometimes it helps to split out less frequently used attributes
but we also try to minimize how many records we load, as the property records are separate from node-records.
Many value graphs elements turn into dense-nodes which can be detrimental for updates and querying across them (as you have to post-filter by the end-node of those relationships).
Seems to be a bug, if you replace the * with the explicit variables it works for me:
MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i:Installments_Payment)
WITH a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,
count(i) AS TotalInstallments
WITH a,DaysInstallMinusEntry,TotalInstallments, [ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus
WITH DaysInstallMinusEntry, TotalInstallments,
size([status in bureauStatus WHERE status]) AS TotalActiveBureau,
size(bureauStatus) as TotalBureau, a
RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau,
toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio
ORDER BY ActiveBureauByTotalBureauRatio DESC
Hmmm, not related to your query, but I often also get black screens when running long queries on AWS.
Are you running it on your personal computer?
Secondly, i'm running Michael's query
Third, your heap is still at 1g ? dbms.memory.heap.max_size=1G