Filtering and Aggregation operation

sagar.chaudhari · November 9, 2023, 7:08am

Hi
As i mentioned in the attachment I have this kind of schema.
I struggling with performance issue on filtering query and count query with 100 millions of node and there relationship.
For count i am using separate query and to get all the nodes under
Transaction with there properties, i am using another query.

Here suppose in this manner

MATCH(txn:Txn)-[r1:NPCI_TRX_PROP{transaction_id:txn.transaction_id}]->(res:ResponseStatus),
(txn)-[r2:NPCI_TRX_PROP{transaction_id:txn.transaction_id}]->(sm:SubMerchant),
(txn)-[r6:NPCI_TRX_PROP{transaction_id:txn.transaction_id}]->(categoryCode:MCC)
RETURN COUNT(txn)

it takes around 45 minutes or sometimes it giver memory issue

glilienfield · November 9, 2023, 10:32am

I have two questions:

What is the purpose of matching each of the paths originating from each transaction when you are returning only the count of the transactions? Is it to ensure that you count only transactions that have each path? I ask because the structure of your query will produce a Cartesian product of the three match results. This would be a concern if any or all of the match statements is returning more than one row. This can impact performance and require memory to cache all the results.

If your intent of the query is to ensure you count a transaction only if it has the three paths then you should use the “exits” predicate to filter base on each of the paths. This will eliminate the Cartesian product and I suspect require less memory since there is not a need to cache the paths, but only determine if they exist.

MATCH(txn:Txn)
WHERE
exists( (txn)-[r1:NPCI_TRX_PROP{transaction_id:txn.transaction_id}]->(res:ResponseStatus) )
and exists( (txn)-[r2:NPCI_TRX_PROP{transaction_id:txn.transaction_id}]->(sm:SubMerchant) ) 
and exists( (txn)-[r6:NPCI_TRX_PROP{transaction_id:txn.transaction_id}]->(categoryCode:MCC) )
RETURN COUNT(txn)

What would be the purpose of the above query? Why don’t you just want the count of the Txn nodes instead?

Why do you need to filter the relationships based on their transaction_id property? It would seem that a given transaction would only have one relationship of type NPCI_TRX_PROP to a ResponseStatus node, since you are constraining the relationship to have its transaction_id equal to the value of the transaction’s transaction_id. If true, the constrain is not necessary. Removing it will improve performance because it will no longer be necessary to check this constraint. The same comment is true for the other two relationships pointing to SubMerchant and MCC nodes. Further, I question the need for the relationship’s transaction_id since it is always equal to the transaction node’s transaction_id. This is just duplicate data that now requires you to be diligent in ensuring the two values are equal.

What is the other query?

sagar.chaudhari · November 9, 2023, 12:01pm

Actually I have
{

"transaction_id":"123ABC321XYZ",
"day":"01",
"month":"11",
"year":"2023",
"submerchant_id":"sub@vpa.com",
"response_code":"1",
"payee":"qaz@gpay.in",
"payer":"zxc@gpay.in"
}
this kind of 500 millions of json and i dumped them in db with above format (shown in above image).
Now on this kind of structure i have to perform single parameter based filter or multi parameter based filter (parameter means keys present in json but in neo4j I have created labels for parameters).
COUNT is required to display total number of records who matched the conditions.

and to achieve pagination kind of functionality.

Please provide any guidance and suggestion or any reference to understand neo4j in better manner. Neo4j (Graph Database) is new for me.

glilienfield · November 9, 2023, 1:01pm

This seems like a use case compatible with a relational database. What is the driver for a graph database?

There is no need to create relationship to date nodes and such unless the relationship brings value. In your case you can store the json values as node properties and search on them using indexes.

What are you trying to achieve?

sagar.chaudhari · November 9, 2023, 1:32pm

I am trying to achieve all the transactions records display on UI in list format and provide search feature based multiple parameters.

and for API creation i am using Node-Js, through node.js i am connecting to Neo4j DB.

To handle 500 million of node

glilienfield · November 9, 2023, 4:54pm

If this is all you are doing, I think you could have used a relational database or time series database.

Anyway. It seems like storing these values as node properties would work. You can add indexes for each you need to search on

sagar.chaudhari · November 10, 2023, 4:26am

Yes, initially i used the node properties and indexing approach.
But to get fast result if I create single node and separate node for there properties with relationship. but it didn't go well.

I will refer time-database but I want to learn neo4j data modelling in better manner if you have any resource please suggest here

Topic		Replies	Views
Too slowly recount nodes and relationships Cypher	3	528	November 29, 2019
Slow subsequent filtering query Neo4j Graph Platform migrated	2	107	June 23, 2022
Slow cypher Cypher	7	513	June 23, 2021
Multiple match takes too long to execute Cypher	4	130	June 28, 2024
A slow running cypher query Cypher performance	8	5190	February 10, 2020

July Summer Fun!

Filtering and Aggregation operation

Related topics