How do I find and fix unpredictable performance and CPU utilization in Neo4J Enterprise?

I'm seeing HUGE changes in performance for the same query while running on a dedicated CentOS 7 system with adequate RAM, CPU, and disk. I'd like guidance in discovering the cause and how to fix it.

I'm running neo4j-enterprise v4.4.3.1. The platform is a CentOS 7 system running on a dedicated AWS EC2 "t2-large" instance. I am starting with the same db dump (created with neo4j-admin), and adding the same data using the same query.

When I say "huge changes in performance", I mean that the same insertion job (a sequence of queries) sometimes finishes in 2903 seconds (just over 48 minutes) and sometimes takes 15792 seconds (4 hours and 23 minutes). That's a factor of FIVE!

This query has been running daily on earlier versions for more than a year, and always runs in about 45-50 minutes. In almost 2 years of experience with Neo4J enterprise, this is the first time I've seen anything like this.

I see no difference in the log files between a fast and slow run. I use Python scripts to run this, so it's being run with the same command line, the same database, and the same data each time.

The RAM utilization seems fixed at 75% (according to top) while Neo4J is running. I'm accustomed to that.

On fast runs, the CPU utilization (according to top) is often between 40% and 100%. On slow runs, it is often 3-4% and never above 10%.

At the moment, it is in the middle of another run and is already slow. I know, from examining the console output, that it is the middle of a neo4j-intensive query. Here is what the first lines of top show while this query is running:

# top
top - 16:57:17 up  8:01,  3 users,  load average: 1.04, 1.08, 1.18
Tasks: 211 total,   1 running, 210 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.3 sy,  0.0 ni, 50.3 id, 49.0 wa,  0.0 hi,  0.0 si,  0.2 st
KiB Mem :  8007412 total,   131620 free,  6417452 used,  1458340 buff/cache
KiB Swap:  2097148 total,  1488380 free,   608768 used.  1328000 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                        
 6468 root      20   0 8633976   5.6g   4944 S   2.7 73.2  21:53.56 java                                           
   38 root      20   0       0      0      0 S   0.3  0.0   2:05.73 kswapd0                                        
    1 root      20   0  125744   1600    896 S   0.0  0.0   0:02.23 systemd                                        
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd  

Neo4J is the only Java app installed on the system. It's using only 2.7% of the CPU -- no wonder it's glacially slow!

According to show transactions (in the Neo4J browser), there are two transactions active:

query-2605:

"call apoc.periodic.iterate("MATCH (target:FIPSItem)<-[:FEATURE]-(dp:Datapoint {datasetID: 'nytimes.20220112.us-counties'})
WHERE
  dp.dailyCaseCount IS NOT NULL AND
  dp.dailyDeathCount IS NOT NULL AND
  dp.cumulativeDeathCount IS NOT NULL AND
  dp.cumulativeCaseCount IS NOT NULL
  RETURN target, dp AS dpTarget
", "MATCH (target)-[:ADJOINS]-(neighbor:FIPSItem)<-[FEATURE]-(dpNeighbor: Datapoint {datasetID: 'nytimes.20220112.us-counties', pertainsDate: dpTarget.pertainsDate})
WHERE
  dpNeighbor.dailyCaseCount IS NOT NULL AND
  dpNeighbor.dailyDeathCount IS NOT NULL AND
  dpNeighbor.cumulativeDeathCount IS NOT NULL AND
  dpNeighbor.cumulativeCaseCount IS NOT NULL
WITH
  dpTarget,
  stDevP(dpNeighbor.cumulativeCaseCount - dpTarget.cumulativeCaseCount) AS cumulativeCaseCountEdgeIntensity,
  stDevP(dpNeighbor.cumulativeDeathCount - dpTarget.cumulativeDeathCount) AS cumulativeDeathCountEdgeIntensity,
  stDevP(dpNeighbor.dailyCaseCount - dpTarget.dailyCaseCount) AS dailyCaseCountEdgeIntensity,
  stDevP(dpNeighbor.dailyDeathCount - dpTarget.dailyDeathCount) AS dailyDeathCountEdgeIntensity

  MERGE (analysisResult: AnalysisResult {analysisResultID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906.' + dpTarget.datapointID })
    SET
      analysisResult.datapointID = dpTarget.datapointID,
      analysisResult.cumulativeCaseCountEdgeIntensity = cumulativeCaseCountEdgeIntensity,
      analysisResult.cumulativeDeathCountEdgeIntensity = cumulativeDeathCountEdgeIntensity,
      analysisResult.dailyCaseCountEdgeIntensity = dailyCaseCountEdgeIntensity,
      analysisResult.dailyDeathCountEdgeIntensity = dailyDeathCountEdgeIntensity
    WITH dpTarget, analysisResult
      MATCH (analysisRun: AnalysisRun {analysisRunID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906'})
        MERGE (analysisResult)-[:ANALYSIS_RUN]->(analysisRun)
        MERGE (dpTarget)-[:ANALYSIS_RESULT]->(analysisResult)
", {batchSize:500})"

and

query-2814:

"UNWIND $_batch AS _batch WITH _batch.target AS target,_batch.dpTarget AS dpTarget  MATCH (target)-[:ADJOINS]-(neighbor:FIPSItem)<-[FEATURE]-(dpNeighbor: Datapoint {datasetID: 'nytimes.20220112.us-counties', pertainsDate: dpTarget.pertainsDate})
WHERE
  dpNeighbor.dailyCaseCount IS NOT NULL AND
  dpNeighbor.dailyDeathCount IS NOT NULL AND
  dpNeighbor.cumulativeDeathCount IS NOT NULL AND
  dpNeighbor.cumulativeCaseCount IS NOT NULL
WITH
  dpTarget,
  stDevP(dpNeighbor.cumulativeCaseCount - dpTarget.cumulativeCaseCount) AS cumulativeCaseCountEdgeIntensity,
  stDevP(dpNeighbor.cumulativeDeathCount - dpTarget.cumulativeDeathCount) AS cumulativeDeathCountEdgeIntensity,
  stDevP(dpNeighbor.dailyCaseCount - dpTarget.dailyCaseCount) AS dailyCaseCountEdgeIntensity,
  stDevP(dpNeighbor.dailyDeathCount - dpTarget.dailyDeathCount) AS dailyDeathCountEdgeIntensity

  MERGE (analysisResult: AnalysisResult {analysisResultID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906.' + dpTarget.datapointID })
    SET
      analysisResult.datapointID = dpTarget.datapointID,
      analysisResult.cumulativeCaseCountEdgeIntensity = cumulativeCaseCountEdgeIntensity,
      analysisResult.cumulativeDeathCountEdgeIntensity = cumulativeDeathCountEdgeIntensity,
      analysisResult.dailyCaseCountEdgeIntensity = dailyCaseCountEdgeIntensity,
      analysisResult.dailyDeathCountEdgeIntensity = dailyDeathCountEdgeIntensity
    WITH dpTarget, analysisResult
      MATCH (analysisRun: AnalysisRun {analysisRunID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906'})
        MERGE (analysisResult)-[:ANALYSIS_RUN]->(analysisRun)
        MERGE (dpTarget)-[:ANALYSIS_RESULT]->(analysisResult)
"

There are indices defined on all reference property names, constraints defined on the various IDs (the code generates a unique ID for each instance of a labeled node), and I've defined label and node indices according to the documentation.

I've seen references to "metrics" in the new documentation, but I don't see how to connect those metrics to what I'm seeing. So far as I know, I've not turned on any of that.

I dialed back the batch size of the above query to 500 to see if I was somehow running out of memory.

Why is the CPU utilization so low? Why do these queries sometimes take so long?

I've been chasing this since last week, and I've run out of things to look at. I invite guidance about what to instrument and how to do so.

Is there some combination of neo4j.conf that restrict CPU utilization? What could make Neo4J spend so little time doing anything when there is so much to be done?

This is a show-stopper.

My next step -- tomorrow morning -- is to revert to an earlier version of Neo4J.

I run behind enough firewalls (on AWS EC2) that I'm frankly ready to go back to whatever version was before all these "log4j" changes. Neo4J has been solid, reliable, and robust for years.

It may be time to open a ticket on this. I've gone as far as I can on my own.

It seems that something I've done in my recent changes has changed the disk usage of Neo4J.

As a result, AWS is throttling the disk IO. That means, in turn, that the system is spending ~50% of its time waiting for disk IO while these Neo4J queries are running. That's why the CPU utilization is so low, and that's why these queries take so long.

I'm now looking for guidance about how to use the metrics to predict the desired IOP thresholds for AWS.

Hello
You can have customized dashboards for chosen matrix of your own on cloud vendor like AWS.
In that you can track the volume of request that were successful.
Thanking you
Yours
Sameer Sudhir Gijare

Perhaps this is a feature request.

I now have the metrics for volume utilization from AWS. That gives me a historical record of the aggregate usage for everything on the EC2 instance.

I now know that the problem is that when I run these Neo4J queries (the ones I've spent the last month optimizing), I now hit the IOPS limit for the AWS EBS volumes used by the EC2 instance that runs Neo4J.

I did not see this issue between April 2020 when I started using Neo4J (on the same AWS EC2 instance) and early January 2022. I've been making many changes to my Neo4J queries, so that's the most obvious source. Meanwhile, there have also been several versionID upgrades for Neo4J itself.

Whatever the history, it appears that my current alternatives are:
a) Increase the IOPS limit for the volumes, or
b) Manage or control the disk utilization of Neo4J

I'd like metrics from Neo4J that let me see the disk IO for Neo4J. It would be especially useful if I could somehow see, for example, the disk usage -- bytes loaded, bytes stored, latency, whatever -- for a given Neo4J transaction ID.

The documentation on Neo4J metrics provides essentially a dictionary view of the 150 different metrics collected by Neo4J. The documentation provides little insight into how those 150 files relate to my specific question ("What is the disk utilization of query XYZ?"). The documentation offers guidance about how to connect the metrics to two different third-party monitoring tools, but there is an apparently lengthy learning curve for each of those. It isn't clear to me that any of these metrics are actually relevant to my question.

If I'm going to pursue option A above, then I'd like to know how much I need. I start with 3K IOPS (apparently), and that's not enough. Do I need 4K? 10K? 100K?

I'd like to have something more concrete than trial-and-error.

@tms
relative to IOPS... see Disks, RAM and other tips - Operations Manual

in a perfect world, if the graph sits entirely in memory, i.e. the size of dbms.memory.pagecache.size, then most queries, specifically READ queries, are not interacting with disk but rather getting data directly from RAM.
However for WRITE queries, as is yours and as your queries involve MERGE statements, we will write to disk at checkpoint time. And whereby checkpoint is defined to at a default of every 15minutes or every 100k txns whichever is first. Both of these can be configuered see Configuration settings - Operations Manual and the 5 next parameters.

But further this can be better understood and from the logs\debug as

grep -i "checkpoint triggered" debug.log

will report similar to

2022-02-03 19:47:57.039+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] [neo4j/28d4319b] 
Checkpoint triggered by "Scheduled checkpoint for every 15 minutes 
threshold" @ txId: 3 checkpoint completed in 170ms

and specifically is your checkpoint completed in XXX ms significantly long?

I upgraded my EC2 instances to t3.large/t3.xlarge. This allows "EBS Optimized" volumes. I upgraded my volumes from gp2 to gp3, and temporarily gave them very high IOPS and throughput settings.

I see dramatically shorter total query times. I still see the "wa" time (as reported by sar and top) in the vicinity of 50% while these queries are running.

I want to be sure we don't talk past each other.

When I run a plain Python task, accessing big files (.csv), I see CPU utilizations in excess of 100% (because there are multiple cores) and wait times of 0 or nearly 0.

I'd like some guidance -- if possible -- about how to predict and/or manage disk utilization while Neo4J is active. It's ok if that's hard, I understand.

I can't throw unlimited disk resources (IOPS and bandwidth) at the instance running Neo4J. I have complete control over the write queries. I'm trying to understand just how big this "bag" has to be.

Maybe I can help a little, but not quite in the way you are thinking.

The only reliable way to predict disk usage is by forecasting the number of nodes you'll read/write to. I'm not sure how to do that.

What you can do, is re-frame the problem into reducing disk IO. I think that's what Dana was trying to get at. If you're running into disk IO limits, then your MERGE is reading and writing a lot of nodes. Running this against an in memory graph via GDS will turn that into one set of reads to create the in-mem named graph, and one set of writes when it's all done.

For my money, if you've got enterprise and you're running into a very expensive artificial VM hardware limit, maybe do it local and save yourself the headache? $1200 could get you good rig with 64G ram, and 10TB storage with no IOPS limits.

You could also try it in Aura.

Heh -- by "very expensive", I meant hundreds of dollars per month rather than tens.

I've already got extremely powerful local hardware, well in excess of the limits you describe.

I guess I have to just accept that there is no way to forecast the number nodes the query will write to. That's unfortunate, because the data itself is reasonably well constrained.

In this case, an "ingestor" collects data from three sources once a day. Each source has one record per day per US county, going back to early 2020. There are about 3,300 counties and about 700 days (by now). The most frequent operation is an "incremental" ingest that adds 2 days of data -- that's about 7K data records per source. So figure 21K records per day.

I have a reasonable handle on the schema. Each record will add on the order of ten labeled nodes, the same number of labeled relationships, and on the order of 10 properties per labeled node. There are indices on each labeled node, labeled relationship, and property.

So I think can get to a ballpark number by multiplying those out. I think it's something like 2 * 3* 3300 new records -- call it 21,000 records. So I think there are about 210,000 new labeled nodes, about 210,000 new labeled relationships, and about 2,100,000 new property values.

I'd like guidance about what IOPS and throughput those metrics imply for a given day's incremental run. It appears that AWS will apply limits to each -- so the task remains to understand what those limits should be based on the metrics I've described.

@tms out of curiosity how have you defined conf/neo4j.conf and parameters

dbms.memory.pagecache.size
dbms.memory.heap.init_size
dbms.memoy.heap.max_size

How large is your graph? On startup we write line similar to

                                                                [neo4j/961926cc]   Total size of mapped files: 51.00GB

for example. And using this value is how large you should configure dbms.memory.pagecache.size.

If you run

egrep "dbms.memory|Total Physical memory|Total size of mapped" debug.log

what does this return?

From neo4j.conf:

dbms.memory.pagecache.size=8m
dbms.memory.heap.initial_size=2000m
dbms.memory.heap.max_size=2000m

Now that you mention it, I reconfigured the AWS EC2 instance after I ran memrec to get those values. Here is the new memrec output:

dbms.memory.heap.initial_size=5000m
dbms.memory.heap.max_size=5000m
dbms.memory.pagecache.size=6600m
dbms.jvm.additional=-XX:+ExitOnOutOfMemoryError

I've just updated neo4j.conf with these new parameters. I'll update this thread in a few days when I've had a chance to gather updated information. I'll post the results of the egrep at that time.

I appreciate your help, and have a great weekend.!