Diagnosing long running queries

I have a read query (call it Query Y) that has been going on for over 8 hours. My dataset is large, but it is not that large that a query would take over 8 hours. A similar query (call it Query X) to this one completed in 61 minutes, before I started with this query.

The query starts out with looking up a Person node with a given id (INDEX is already applied on the id field). Then it locates friends and friends of friends of that Person, and goes on to locate all Forum nodes where the friends are a member of, using a hasMember relationship, such that the joinDate property on the relationship is after a given start date. Lastly, it needs to return the count of all Post nodes, for posts which were created by these friends, that are contained in the forum.

All nodes are indexed on their id field.

Here's the query plan with EXPLAIN -

I could not get the PROFILE results because the query never completed.

When I looked at the debug.log file, I noticed a lot of Stop-the-World (STW) pauses, sprinkled around during the execution of these two long running queries (X and Y). On reading about it, it seems that these pauses occur when the JVM GC pauses all other application threads.

Here is the relevant portion from the debug.log file -

2020-01-07 15:39:11.648+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=180, gcTime=192, gcCount=1}
2020-01-07 15:39:11.824+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=130, gcTime=0, gcCount=0}
2020-01-07 15:49:57.542+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=189, gcTime=212, gcCount=1}
2020-01-07 15:50:01.690+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=119, gcTime=130, gcCount=1}
2020-01-07 16:01:23.829+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=162, gcTime=173, gcCount=1}
2020-01-07 16:21:25.807+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=130, gcTime=139, gcCount=1}
2020-01-07 16:21:25.995+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=0, gcCount=0}
2020-01-07 16:22:28.916+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=125, gcTime=188, gcCount=1}
2020-01-07 16:22:30.558+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=314, gcTime=0, gcCount=0}
2020-01-07 16:25:31.980+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=135, gcTime=144, gcCount=1}
2020-01-07 16:25:32.231+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=169, gcTime=0, gcCount=0}

The complete set of STW logs while the two queries were running is in this github gist, and the complete debug.log file is here. For reference, the relevant logs would be at the very end, on 2020-01-07 (when I executed these two queries).

What can I do in this case? I had already stopped and re-started both queries X and Y, when they were taking too long. While query X completed in 61 minutes the second time, query Y went on for over 8 hours, and never completed. For comparison, the other queries I execute takes only 1-3 minutes on average, for this dataset. My memory settings in the configuration file are set at heap(initial) - 4G, heap(max) - 8G and pagecache - 6G, based on the recommendations suggested here. I don't have the exact recommendations by the memrec tool, because I'm on JDK 11, and the tool seems to work only with JDK 8.

I have a 16GB RAM, should I set the heap size (both initial and max) to maybe 12G, based on the suggested config -

The heap memory size is determined by the parameters dbms.memory.heap.initial_size and dbms.memory.heap.max_size . It is recommended to set these two parameters to the same value. This will help avoid unwanted full garbage collection pauses.

  • neo4j version, desktop version, browser version - 3.5.14, 1.2.3
  • what kind of API / driver do you use - Cypher in Neo4j Desktop

thanks for the debug.log

Couple of issues.
We recommend setting dbms.memory.heap.init_size and dbms.memory.heap.max_size to the same value. From your debug.log I see for example

2020-01-04 00:08:44.292+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.pagecache.size=6G
2020-01-04 00:08:44.293+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.heap.max_size=8G
2020-01-04 00:08:44.293+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.jvm.additional=-XX:+UseG1GC

so if anything lets set them to the same value.

As stated in your initial problem description

I have a 16GB RAM, should I set the heap size (both initial and max) to maybe 12G , based on the suggested config -

and yes this is evident by

2020-01-04 00:08:45.884+0000 INFO [o.n.i.d.DiagnosticsManager] Total Physical memory: 16.00 GB

But the size of your graph is at least some 34G, as evidence

2020-01-01 23:24:53.604+0000 INFO [o.n.i.d.DiagnosticsManager]   Total size of mapped files: 35.74 GB

so its going to be difficult to get the entire graph in RAM (i.e. dbms.memory.pagecache.size) given the size of your graph is 2x the total RAM of the instance. Neo4j can still operate when the entire graph does not fit all in RAM.
From the logs it appears you have set the dbms.memory.pagecache.size as

2020-01-04 00:39:58.522+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.pagecache.size=6G

so if 6G goes to the pagecache and we set 8G to the heap for both min/max heap, we would still have 2G left over for other OS processing. But still that means only 6G of the 34GB graph is in RAM, so less than 25% of the graph is in RAM. If you had more RAM we could increase the pagecache and get more of the graph in RAM.

Finally in looking at the debug.log you have what appears to be the default settings for transaction log retention which is defaults to each log should grow as much as 250MB and keep 7 days worth of these logs. You will see dozens and dozens of logs and as reported in the debug.log at

2020-01-04 00:15:29.964+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.77: 2019-12-26T20:28:22-0500 - 250.02 MB
2020-01-04 00:15:29.964+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.78: 2019-12-26T20:31:00-0500 - 250.03 MB
2020-01-04 00:15:29.965+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.79: 2019-12-26T20:32:46-0500 - 250.02 MB
2020-01-04 00:15:29.965+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.8: 2019-12-26T19:31:14-0500 - 250.03 MB
2020-01-04 00:15:29.965+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.80: 2019-12-26T20:33:25-0500 - 250.01 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.81: 2019-12-26T20:33:53-0500 - 250.01 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.82: 2019-12-26T20:34:19-0500 - 250.02 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.83: 2019-12-26T20:34:44-0500 - 250.03 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.84: 2019-12-26T20:35:11-0500 - 250.02 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.85: 2019-12-26T20:35:35-0500 - 250.03 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.86: 2019-12-26T20:36:02-0500 - 250.01 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.87: 2019-12-26T20:36:29-0500 - 250.04 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.88: 2019-12-26T20:36:55-0500 - 250.02 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.89: 2019-12-26T20:37:33-0500 - 250.03 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.9: 2019-12-26T19:31:37-0500 - 250.01 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.90: 2019-12-26T20:37:58-0500 - 250.01 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.91: 2019-12-26T20:38:28-0500 - 250.02 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.92: 2019-12-26T20:40:04-0500 - 250.00 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.93: 2019-12-26T20:40:29-0500 - 250.04 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.94: 2019-12-26T20:40:56-0500 - 250.04 MB
2020-01-04 00:15:29.969+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.95: 2019-12-26T20:41:21-0500 - 250.03 MB
2020-01-04 00:15:29.969+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.96: 2019-12-26T20:42:59-0500 - 250.03 MB
2020-01-04 00:15:29.969+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.97: 2019-12-26T20:43:26-0500 - 250.00 MB
2020-01-04 00:15:29.970+0000 INFO [o.n.i.d.DiagnosticsManager]   neostore.transaction.db.98: 2019-12-26T20:43:52-0500 - 250.01 MB

Although retention settings of the transaction logs will not hurt/improve query performance you might want to reconsider the retention duration. Please see Transaction logs - Operations Manual for details on configuring retention

1 Like

Thank You for the detailed response. I really appreciate it.

I set the heap (both initial and max) size to 10G, and started the query. It's been over two hours, and it is still running, but this time I don't see any stop-the-world pause logs in debug.log. The latest entry in debug.log is -

INFO [o.n.k.i.p.PageCacheWarmer] Page Cache warmup completed. 783341 pages loaded. Duration: 40m 30s 860ms.

Regarding this -

My initial dataset (csv files) is around 30GB, which might be how the total size of mapped files is 35.74 GB, but the entire database size in Neo4j (after loading all the nodes and relationships) comes to around 240GB. I'm not sure if that number needs to be taken into account for the pagecache size calculation, instead of 35.74 GB.

Either way, I am now able to understand what pagecache size actually means in terms of how the graph fits in memory, thanks to your explanation -

I wasn't quite clear on this from reading the docs.

I'll keep the query running this time, and hope it completes at some point. I will keep you posted.

My initial dataset (csv files) is around 30GB, which might be how the total size of mapped files is 35.74 GB, but the entire database size in Neo4j (after loading all the nodes and relationships) comes to around 240GB.

240GB? If so ??? and if you only have 16GB of total RAM on the instance then less than 10% of the graph would be in RAM.
How did you determine the database size is 240GB?
Is this inclusive of transaction logs (i.e. neostore.transaction*) files which are not to be included in database size since they are not loaded into the pagecache. Which is why my last response also said you might want to change your transaction log retention parameters?

also

INFO [o.n.k.i.p.PageCacheWarmer] Page Cache warmup completed. 783341 pages loaded. Duration: 40m 30s 860ms.

which took 40minutes and appears a bit slow. Is you graph on SSD drives or spinning disk?

This is the folder size of the database folder corresponding to my project on Neo4j Desktop. I got this folder path from the solution recommended here. It is located under neo4jDatabases/database-xxxx-xxxx-xxxx-xxxx, as shown in this screenshot -

This is also the size that used to be shown up until a few months ago in Neo4j Browser itself, as you can see here -

69657443-3882b000-1048-11ea-9601-d4c0ac6d1db1

But after the last few updates in Neo4j Desktop, this size is no longer shown, so I'm guessing they must have removed it.

Since I'm taking the folder size of the entire database-xxxx folder, I think it does include all the transaction log files, indexes, etc.

In fact, I've been meaning to ask this for quite sometime now - all my datasets(csv files) expand to 7 times its size when completely loaded into Neo4j. For instance, my 1GB dataset occupies roughly 7GB disk space when imported in Neo4j, 3GB dataset occupies around 23GB disk space, and so on... I'm not sure why this is the case.

The graph is on a spinning disk.

ok... spinning disk might account for the slower performance, as well as the fact that not much of the graph is in RAM (i.e. dbms.memory.pagecache.size).

Database size is not simply determined by the total size of the graph.db file system. I could create 10m nodes every day and retain transaction logs for 30 days versus a system where I create 10m nodes every day but retain transaction logs for only 3 days. Same experience in terms of total number of nodes after 30 days, but the graph.db size will be greatly different based upon transaction log retention differences

Also, there's some context as to why I'm running the project on a spinning disk (my personal laptop) instead of an SSD. I'm working on my thesis with Neo4j for the last couple of years now (back when Neo4j was in v2), and during this time, I've tried to run it several times on AWS EC2 with much more powerful instances, like the m3 large, etc. But since I was using the community edition then, I came across a lot of issues, like db crashes, hang/freezing, etc which would require re-running the same query multiple times, or sometimes to start over with importing the data from scratch all over again.

During this time, I've logged some of these issues in the neo4j github repository, like this and this.

Since I'm just a student, using powerful EC2 instances for all of this would have exponentially increased the cost, especially since there was no way of knowing if the current operation would complete successfully or not.

In fact, people still report issues like exponentially large query times or unresponsiveness while using EC2 instances like m5.2x large, as seen in this post.

Regarding this, is there some resource that I can read up on, to understand how to effectively change the retention log parameters?

Thank You once again..