200 mio nodes what is slowing down this query and causes this error Neo.TransientError.General.OutOfMemoryError

Dear community

I have issues with performance of the graph and queries when dealing with many millions of nodes.
The goal is probably to be able to handle billions of nodes later after this PoC.

I have a larger query which does not perform at since I added much more nodes.
So I boiled the issue down to the problem, that I have too many nodes or a too small instance (32GB) AWS EC2 instance with gp2 drive for the database.

What could I improve to make such a query on 200 mio nodes much more performant or work at all?

MATCH (s:Session)
RETURN 
s.countryoforigin AS countryoforigin,
apoc.temporal.format(s.sessionstarttime,"YYYY-\'W\'ww") as activity_week, 
count(DISTINCT(s)) AS session_count

Currently I am getting a Neo.TransientError.General.OutOfMemoryError

I am running the query from Neo4j Desktop with remote connection to Neo4j 4.1 database.

Neo4j Browser version: [4.1.2]
Neo4j Server version: [4.1.0]

Here the plan
plan

Is the instance of 32GB your storage or RAM? Also what are your memory allocation settings in your config file?

My RAM is 32GB large. I increased it even to 60 GB RAM now. In the config file I did not change anything according to memory allocation. So it must be default allocation methods. If I remember well I read somewhere from the docs in 4.1 that it will automatically choose the right size according to your RAM capacity. Something like 2/3 to 1/3 (heap to cache). I am wrong with that?

Running the following

neo4j-admin memrec --memory=60g --verbose

gives me

neo4j 4.1.0
VM Name: OpenJDK 64-Bit Server VM
VM Vendor: Ubuntu
VM Version: 11.0.8+10-post-Ubuntu-0ubuntu118.04.1
JIT compiler: HotSpot 64-Bit Tiered Compilers
VM Arguments: [-XX:+UseParallelGC, -Xms10g, -Xmx10g, -Dfile.encoding=UTF-8]
# Memory settings recommendation from neo4j-admin memrec:
#
# Assuming the system is dedicated to running Neo4j and has 60.00GiB of memory,
# we recommend a heap size of around 23000m, and a page cache of around 26g,
# and that about 11800m is left for the operating system, and the native memory
# needed by Lucene and Netty.
#
# Tip: If the indexing storage use is high, e.g. there are many indexes or most
# data indexed, then it might advantageous to leave more memory for the
# operating system.
#
# Tip: Depending on the workload type you may want to increase the amount
# of off-heap memory available for storing transaction state.
# For instance, in case of large write-intensive transactions
# increasing it can lower GC overhead and thus improve performance.
# On the other hand, if vast majority of transactions are small or read-only
# then you can decrease it and increase page cache instead.
#
# Tip: The more concurrent transactions your workload has and the more updates
# they do, the more heap memory you will need. However, don't allocate more
# than 31g of heap, since this will disable pointer compression, also known as
# "compressed oops", in the JVM and make less effective use of the heap.
#
# Tip: Setting the initial and the max heap size to the same value means the
# JVM will never need to change the heap size. Changing the heap size otherwise
# involves a full GC, which is desirable to avoid.
#
# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=23000m
dbms.memory.heap.max_size=23000m
dbms.memory.pagecache.size=26g
#
# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
#dbms.jvm.additional=-XX:+ExitOnOutOfMemoryError
#
# The numbers below have been derived based on your current databases located at: '/var/lib/neo4j/data/databases'.
# They can be used as an input into more detailed memory analysis.
# Total size of lucene indexes in all databases: 0k
# Total size of data and native indexes in all databases: 242700m

I set my configurations to

dbms.memory.heap.initial_size=23000m
dbms.memory.heap.max_size=23000m
dbms.memory.pagecache.size=26g

But still the same issue. I think it is because all indexes are more than 242 GB large.

Just want to point out that this statement causes a full table scan, every row, and the count(distinct( considers the entire row set.

Honestly this looks like a SQL query to me, is this really what you need (e.g the goal)?

also at first glance it looks like count(distinct(s)) will very likely always return 1?

@Joel thanks for your reply.

This is a try of giving you a minimal example to understand from my side what inhibits the performance. I will give you the actual full query which I am aiming for further below. But first of all to answer your last question I give you a bit more simplified example.

MATCH (s:Session {countryoforigin:"CH"})
RETURN 
s.countryoforigin AS countryoforigin,
apoc.temporal.format(s.sessionstarttime,"YYYY-\'W\'ww") as activity_week, 
count(DISTINCT(s)) AS session_count

returns something like this.

╒═════════════════╤═══════════════╤═══════════════╕
│"countryoforigin"│"activity_week"│"session_count"│
╞═════════════════╪═══════════════╪═══════════════╡
│"CH"             │"2020-W24"     │xxxx12         │
├─────────────────┼───────────────┼───────────────┤
│"CH"             │"2020-W18"     │xxxx66         │
├─────────────────┼───────────────┼───────────────┤
│"CH"             │"2020-W22"     │xxxx29         │
├─────────────────┼───────────────┼───────────────┤
│"CH"             │"2020-W23"     │xxxx83         │
├─────────────────┼───────────────┼───────────────┤

I crossed out the real results in the session_count column.

My final goal is to make the calculation of the following query possible with let us say up to 1 billion session nodes.

// Weekly Active User Analysis
MATCH (u:User)-[:IS_IN_COHORT]-(c:CohortWeekly) WHERE c.cohort >= "2020-W01"
MATCH (c)-[:IS_IN_COHORT]-(u)-[g:GAVE_UB_CONSENT]-(d:Device)
WITH u,d,c,count(g) AS total
MATCH (u)-[:IS_IN_COHORT]-(c)
MATCH (c)-[:IS_IN_COHORT]-(u)-[g:GAVE_UB_CONSENT]-(d)
MATCH (u)-[:GAVE_UB_CONSENT]-(d)-[:IS_SESSION_OF_DEVICE]-(s:Session)
RETURN 
s.countryoforigin AS countryoforigin, 
c.cohort AS cohort_week, 
c.total_user_count AS total_user_count,
total AS users_with_ub_consent, 
apoc.temporal.format(s.sessionstarttime,"YYYY-'W'ww") as activity_week, 
count(DISTINCT(u)) AS active_users, 
(toFloat(count(DISTINCT(u.userid)))/total)*100 AS active_rate_percent,
count(s) AS session_count
ORDER BY 
countryoforigin,
cohort_week,
activity_week

But since the simple query does not run well, I am without chance to make the larger query running and finishing.

I am basically wondering how to do queries where the aggregation seems too big to be fit into memory.

I would look for ways to break this down into achievable units of work.

It feels like there might be base model changes that would help, but also there may be a model design here for a smaller graph containing derived nodes/edges.

Have you considered iteratively calculating counts (apoc parallel cypher on groups) then storing the count results back into the graph? (e.g. a property on a node, which quite likely is a new summary level node type). I could envision a smaller derived graph, quite possibly a completely separate graph (though it could have edges back into the larger graph, if that was useful for other queries...)

Other thoughts that come to mind are, did you consider having

  • countryoforigin node?
  • activity_week node?

If you are going to have queries like that then you are better off making the country and weeks as nodes. As of now you are causing too many page faults as you are scanning almost all of the database and trying to keep that in memory. That way you can convert this to traversal problem and handle the query using less amount of memory.

Will this all really help if I would like to have an overview and aggregation / grouping count of nearly all nodes?