How to efficiently query over 100 million nodes on a system with 16GB RAM?

Hello everyone,

I am currently working with a Neo4j database that contains billions of nodes, and my system has only 16GB of physical RAM. I'm encountering memory issues, especially when running complex queries that involve large-scale node aggregations. The system often returns "Neo.TransientError.General.MemoryPoolOutofMemoryError"

Here is an example of the query logic I am using:

MATCH (cs:CATALOG_SALES), (c:CUSTOMER), (ca:CUSTOMER_ADDRESS), (d:DATE_DIM)
WHERE cs.cs_bill_customer_sk = c.c_customer_sk
AND c.c_current_addr_sk = ca.ca_address_sk
AND cs.cs_sold_date_sk = d.d_date_sk
AND (ca.ca_zip IN [85669, 86197, 88274,83405, 86475, 85392, 85460, 80348,81792]
OR ca.ca_state IN ['CA', 'WA', 'GA']
OR cs.cs_sales_price > 500)
AND d.d_qoy = 2
AND d.d_year = 2001
RETURN ca.ca_zip AS ca_zip, SUM(cs.cs_sales_price) AS total_sales_price
ORDER BY ca_zip
LIMIT 100;

I understand that this is not the most optimal query approach, but currently, I need to build joins instead of using edges . Under this approach, performance issues become more prominent. When running similar queries, I frequently encounter memory limitations.

To address this, I have already tried the following optimizations:

  1. Adjusting memory settings:
  • dbms.memory.heap.initial_size=4G
  • dbms.memory.heap.max_size=11G
  • dbms.memory.pagecache.size=4G
  1. Using indexes: I've created indexes on the main fields involved in the queries.

Despite these efforts, I still hit memory bottlenecks in some query scenarios. Given that my system only has 16GB of RAM, I would appreciate any additional optimization strategies or best practices for handling large-scale datasets in this context.

Also, any suggestions regarding data modeling, query optimization, or best practices for efficiently building joins in a graph database would be greatly appreciated! Thank you all in advance!

@Ran017

What version of Neo4j?

Have you tried to PROFILE the query to understand how the query is processed?

I understand that this is not the most optimal query approach, but currently, I need to build joins instead of using edges .

that just seems to rather defeat any / all benefit a graph database and with edges offers.

  1. Adjusting memory settings:
  • dbms.memory.heap.initial_size=4G
  • dbms.memory.heap.max_size=11G
  • dbms.memory.pagecache.size=4G

but yet documentation suggest that min/max heap should be the same.

The heap memory size is determined by the parameters 
server.memory.heap.initial_size and server.memory.heap.max_size. 
It is recommended to set these two parameters to the same value
 to avoid unwanted full garbage collection pauses

Not that its going to solve your memory consumption usage etc.

I agree with Dana here... Neo4j, a graph database, is optimized for graphs and capturing information via relationships (edges). What you're trying to do here looks like you're trying to make Neo4j behave like a relational database, and I'm guessing that when you profile the query, you're going to see that this "isn't a great idea" :slight_smile:

Thank you for your reply.
I am currently using Neo4j version 5.20.0. I have already used the PROFILE command on small datasets, but now I need to run the query on a much larger dataset.
Following your advice of the memory settings, I have already change initial_size to 11G as well.

Thank you for your input, and I completely understand that this approach may not be ideal for Neo4j. However, due to the specific requirements of my project, I still need to proceed with this method...

That being said, if anyone has any tips or suggestions for optimizing memory usage in this situation, I'd really appreciate your advice!

@Ran017

also you indicate

  • dbms.memory.pagecache.size=4G

how large is your database?

I fully agree with @john.stegeman. You are trying to perform inner joins. This is not the use case for neo4j. You should replace your use of primary and foreign keys with relationships.