Hello everyone,
I am currently working with a Neo4j database that contains billions of nodes, and my system has only 16GB of physical RAM. I'm encountering memory issues, especially when running complex queries that involve large-scale node aggregations. The system often returns "Neo.TransientError.General.MemoryPoolOutofMemoryError"
Here is an example of the query logic I am using:
MATCH (cs:CATALOG_SALES), (c:CUSTOMER), (ca:CUSTOMER_ADDRESS), (d:DATE_DIM)
WHERE cs.cs_bill_customer_sk = c.c_customer_sk
AND c.c_current_addr_sk = ca.ca_address_sk
AND cs.cs_sold_date_sk = d.d_date_sk
AND (ca.ca_zip IN [85669, 86197, 88274,83405, 86475, 85392, 85460, 80348,81792]
OR ca.ca_state IN ['CA', 'WA', 'GA']
OR cs.cs_sales_price > 500)
AND d.d_qoy = 2
AND d.d_year = 2001
RETURN ca.ca_zip AS ca_zip, SUM(cs.cs_sales_price) AS total_sales_price
ORDER BY ca_zip
LIMIT 100;
I understand that this is not the most optimal query approach, but currently, I need to build joins instead of using edges . Under this approach, performance issues become more prominent. When running similar queries, I frequently encounter memory limitations.
To address this, I have already tried the following optimizations:
- Adjusting memory settings:
dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=11G
dbms.memory.pagecache.size=4G
- Using indexes: I've created indexes on the main fields involved in the queries.
Despite these efforts, I still hit memory bottlenecks in some query scenarios. Given that my system only has 16GB of RAM, I would appreciate any additional optimization strategies or best practices for handling large-scale datasets in this context.
Also, any suggestions regarding data modeling, query optimization, or best practices for efficiently building joins in a graph database would be greatly appreciated! Thank you all in advance!