How to efficiently query over 100 million nodes on a system with 16GB RAM?

Ran017 · September 18, 2024, 3:54am

Hello everyone,

I am currently working with a Neo4j database that contains billions of nodes, and my system has only 16GB of physical RAM. I'm encountering memory issues, especially when running complex queries that involve large-scale node aggregations. The system often returns "Neo.TransientError.General.MemoryPoolOutofMemoryError"

Here is an example of the query logic I am using:

MATCH (cs:CATALOG_SALES), (c:CUSTOMER), (ca:CUSTOMER_ADDRESS), (d:DATE_DIM)
WHERE cs.cs_bill_customer_sk = c.c_customer_sk
AND c.c_current_addr_sk = ca.ca_address_sk
AND cs.cs_sold_date_sk = d.d_date_sk
AND (ca.ca_zip IN [85669, 86197, 88274,83405, 86475, 85392, 85460, 80348,81792]
OR ca.ca_state IN ['CA', 'WA', 'GA']
OR cs.cs_sales_price > 500)
AND d.d_qoy = 2
AND d.d_year = 2001
RETURN ca.ca_zip AS ca_zip, SUM(cs.cs_sales_price) AS total_sales_price
ORDER BY ca_zip
LIMIT 100;

I understand that this is not the most optimal query approach, but currently, I need to build joins instead of using edges . Under this approach, performance issues become more prominent. When running similar queries, I frequently encounter memory limitations.

To address this, I have already tried the following optimizations:

Adjusting memory settings:

dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=11G
dbms.memory.pagecache.size=4G

Using indexes: I've created indexes on the main fields involved in the queries.

Despite these efforts, I still hit memory bottlenecks in some query scenarios. Given that my system only has 16GB of RAM, I would appreciate any additional optimization strategies or best practices for handling large-scale datasets in this context.

Also, any suggestions regarding data modeling, query optimization, or best practices for efficiently building joins in a graph database would be greatly appreciated! Thank you all in advance!

dana_canzano · September 18, 2024, 6:23pm

@Ran017

What version of Neo4j?

Have you tried to PROFILE the query to understand how the query is processed?

I understand that this is not the most optimal query approach, but currently, I need to build joins instead of using edges .

that just seems to rather defeat any / all benefit a graph database and with edges offers.

Adjusting memory settings:

dbms.memory.heap.initial_size=4G

dbms.memory.heap.max_size=11G

dbms.memory.pagecache.size=4G

but yet documentation suggest that min/max heap should be the same.

The heap memory size is determined by the parameters 
server.memory.heap.initial_size and server.memory.heap.max_size. 
It is recommended to set these two parameters to the same value
 to avoid unwanted full garbage collection pauses

Not that its going to solve your memory consumption usage etc.

john.stegeman · September 18, 2024, 9:00pm

I agree with Dana here... Neo4j, a graph database, is optimized for graphs and capturing information via relationships (edges). What you're trying to do here looks like you're trying to make Neo4j behave like a relational database, and I'm guessing that when you profile the query, you're going to see that this "isn't a great idea"

Ran017 · September 18, 2024, 9:09pm

Thank you for your reply.
I am currently using Neo4j version 5.20.0. I have already used the PROFILE command on small datasets, but now I need to run the query on a much larger dataset.
Following your advice of the memory settings, I have already change initial_size to 11G as well.

Ran017 · September 18, 2024, 9:17pm

Thank you for your input, and I completely understand that this approach may not be ideal for Neo4j. However, due to the specific requirements of my project, I still need to proceed with this method...

That being said, if anyone has any tips or suggestions for optimizing memory usage in this situation, I'd really appreciate your advice!

dana_canzano · September 19, 2024, 4:23pm

@Ran017

also you indicate

dbms.memory.pagecache.size=4G

how large is your database?

glilienfield · September 19, 2024, 5:31pm

I fully agree with @john.stegeman. You are trying to perform inner joins. This is not the use case for neo4j. You should replace your use of primary and foreign keys with relationships.

Topic		Replies	Views
200 mio nodes what is slowing down this query and causes this error Neo.TransientError.General.OutOfMemoryError Cypher optimization , performance , cypher	8	904	September 30, 2020
Memory settings for big databases Server performance	40	3215	August 1, 2019
Error with memory neo4j browser Neo4j Graph Platform browser	3	576	May 11, 2020
Memory issue for a query General memory	6	161	April 17, 2025
Tuning for larger-than-memory multiple-TB graph node insertion Operations performance	4	675	June 21, 2021

Get Certified in June!

How to efficiently query over 100 million nodes on a system with 16GB RAM?

Related topics