Checkpoint trigger killing performance

(Massung) #1

I have a lot of output Spark data that I'm attempting to load via LOAD CSV, and everything appears to go well until a checkpoint is hit, at which point the server performance not only tanks, but it refuses to improve.

Here's some example log output both from my app, which is loading the part files generated by Spark and the debug.log file from Neo4j:

11:59:15 DEBUG Uploading 200 part files...
11:59:20 DEBUG ...46307 nodes & 185228 relationships created (1/200)
11:59:25 DEBUG ...46183 nodes & 184732 relationships created (2/200)
11:59:30 DEBUG ...46363 nodes & 185452 relationships created (3/200)
11:59:35 DEBUG ...46642 nodes & 186568 relationships created (4/200)
11:59:39 DEBUG ...46632 nodes & 186528 relationships created (5/200)

<.. snipped another 75 part files, all similar ..>

12:05:19 DEBUG ...46360 nodes & 185440 relationships created (80/200)
12:05:24 DEBUG ...46662 nodes & 186648 relationships created (81/200)
12:05:29 DEBUG ...46623 nodes & 186492 relationships created (82/200)
12:05:33 DEBUG ...46525 nodes & 186100 relationships created (83/200)
12:05:38 DEBUG ...47153 nodes & 188612 relationships created (84/200)
12:05:42 DEBUG ...46397 nodes & 185588 relationships created (85/200)

<.. Notice the time jump at this point and from here on ..>

12:10:53 DEBUG ...46599 nodes & 186396 relationships created (86/200)
12:11:23 DEBUG ...46255 nodes & 185020 relationships created (87/200)
12:11:52 DEBUG ...46752 nodes & 187004 relationships created (88/200)
12:12:24 DEBUG ...46622 nodes & 186488 relationships created (89/200)
12:13:00 DEBUG ...46863 nodes & 187452 relationships created (90/200)
12:13:26 DEBUG ...46954 nodes & 187816 relationships created (91/200)
12:13:54 DEBUG ...46624 nodes & 186492 relationships created (92/200)
12:14:22 DEBUG ...47058 nodes & 188232 relationships created (93/200)
12:14:48 DEBUG ...46600 nodes & 186400 relationships created (94/200)
12:15:16 DEBUG ...46621 nodes & 186484 relationships created (95/200)
12:15:51 DEBUG ...46261 nodes & 185040 relationships created (96/200)
12:16:19 DEBUG ...46848 nodes & 187392 relationships created (97/200)
12:16:53 DEBUG ...46785 nodes & 187140 relationships created (98/200)
12:17:43 DEBUG ...46810 nodes & 187236 relationships created (99/200)

<.. Performance continues to get worse and worse ..>

And here's the debug log (note: time zone differences mean the debug log is +600 hours):

2019-02-26 18:04:16.625+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=116, gcTime=112, gcCount=1}
2019-02-26 18:04:44.788+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=104, gcTime=115, gcCount=1}
2019-02-26 18:04:48.809+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.transaction.db.3252] version=3252, last transaction in previous log=664880
2019-02-26 18:05:07.448+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=122, gcTime=127, gcCount=1}
2019-02-26 18:05:21.127+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=111, gcTime=113, gcCount=1}
2019-02-26 18:05:25.886+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.transaction.db.3253] version=3253, last transaction in previous log=664888
2019-02-26 18:05:52.986+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 664891 checkpoint started...
2019-02-26 18:05:53.448+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 664891 to [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.counts.db.b], from [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.counts.db.a].
2019-02-26 18:10:45.076+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 664891 checkpoint completed in 4m 52s 89ms
2019-02-26 18:10:47.428+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 2857-2863, last checkpoint was made in version 3253
2019-02-26 18:12:56.884+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.transaction.db.3254] version=3254, last transaction in previous log=664896

Note how at 18:05:52 in the debug.log file is when the checkpoint occurs, and at the same time in my output the part files go from taking 4-5 seconds per to 5 minutes (!!) for the next one (okay, I can chalk that up to the checkpoint), and from then on start taking 20, 30, 40+ seconds per part file. If I let it keep running in the hopes that the performance will eventually improve, it absolutely does not (I've let it go for a solid 8 hours and it never finishes all 200 part files). This is horrid as it only took ~6 minutes to upload the first 85 part files.

In addition, any other work thrown at the graph (simple queries) can take a long time to return, and I end up having to stop and restart the server in order to get things reasonable again.

This is using Neo4j 3.5.1 community, running on an EC2 instance w/ 16 cores and 64 GB of RAM.

One final note: before the checkpoint hits, using htop I can see that all 16 cores are regularly being utilized. After the checkpoint, all concurrency appears to completely halt.

What is going on here? Is this a problem with Neo4j or is there some configuration setting I should be looking at?

Thanks for any insights.

(Stefan Armbruster) #2

I guess you should run checkpointing more frequently. Default is every 900s, maybe reduce that to 60s to see more, but less invasive checkpointing. There's a good knowledge base article at https://neo4j.com/developer/kb/checkpointing-and-log-pruning-interactions/

Note that enterprise edition has better strategies for checkpointing. In EE you can also limit IOPS being used for checkpointing https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.checkpoint.iops.limit

You can also check if you can increase IOPS for your volume in AWS dashboard.

(Massung) #3

I appreciate the comment, but this isn't an IOPS issue from everything I can see. I have a CloudWatch dashboard setup and write throughput never break 1 MB/s and the permitted throughput to the EFS is 105 MB/s.

When the checkpoint hits, the CPU usage drops from 10% down to < 1% and pretty much just sits there. Assuming that the checkpoint takes a bit to process (let's be extremely liberal and say 5 minutes), the read/write throughput combined never breaks 100 K/s from then on.

Progress is still being made, but it's just abysmally slow and there is no apparent reason for it. It's not attempting to read/write anything (of significant size), the CPU isn't busy doing anything, RAM isn't close to fully utilized, and it's downloading less from S3 than it was previously.

That said, after watching everything for the past 3 hours more closely, the following exceptions are now being thrown periodically in the debug log:

2019-02-26 21:37:41.892+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0x58433a9d, L:/10.0.0.122:7687 - R:/69.173.127.108:48705] readAddress(..) failed: Connection timed out
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
        at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
        at io.netty.channel.unix.Errors.ioResult(Errors.java:146)
        at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:178)
        at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:343)
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:785)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:421)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:321)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
        at java.lang.Thread.run(Thread.java:748)

I didn't notice these before, but I was just grabbing the end of the log for the post after killing it when it hit the issue. This time I've been watching it with tail -f and let it go longer just to confirm I wasn't being throughput throttled by Amazon.

I assume those are my connections via Scala issuing the query and then it taking too long to reply with the result, but I'm not 100% sure on that.

Update

As I was putting together a couple thumbnails of graphs to show the server flat-lining, life started breathing back into it. But this was close to 30 minutes after flat-lining. It's still much slower than the 4-5 seconds per part file (averaging 30 seconds per), but that's better than the minutes it previously was.

I want to be clear that I'm not looking for someone to diagnose hardware/IOPS issues.

I'm trying to say that - given all the data I have - it doesn't look to be an infrastructure problem as opposed to a Neo4j issue: it just stopped doing what I thought it should be doing. And I have no way (that I know of) to inspect and see what that is doing. It wasn't writing to disk, reading, downloading, swapping RAM, or churning away on the CPU doing some big operation. It just appeared to stop.

I'd very much like to understand what was going on so either I - or the Neo4j team - can fix it. Having the DB appear to just "stop" for 30+ minutes for no apparent reason is no good.

(Andrew Bowman) #4

It would help to know more about your LOAD CSV query. Do you have indexes in place on the relevant labels/properties that you're using for the load? If not, that would easily explain the escalating time to process subsequent loads, as well as the read timeouts.

(Massung) #5

Here's the query in all it's (simple) glory:

USING PERIODIC COMMIT 50000
LOAD CSV WITH HEADERS FROM '$part' AS r
FIELDTERMINATOR '\t'

// lookup the analysis node previously created with $id
MATCH (q:Analysis) WHERE ID(q)=$id

// die if the phenotype or ancestry doesn't exist (uniquely indexed by name)
MATCH (p:Phenotype {name: r.phenotype})
MATCH (a:Ancestry {name: r.ancestry})

// create the variant node if it doesn't exist (uniquely indexed by name)
MERGE (v:Variant {name: r.varId})
ON CREATE SET
  v.chromosome=r.chromosome,
  v.position=toInteger(r.position),
  v.reference=r.reference,
  v.alt=r.alt

// create the frequency result node
CREATE (n:Frequency {
  eaf: toFloat(r.eaf),
  maf: toFloat(r.maf)
})

// create the relationship to the analysis node
CREATE (q)-[:PRODUCED]->(n)

// create the relationship to the trait, ancestry, and variant
CREATE (p)-[:HAS_FREQUENCY]->(n)
CREATE (v)-[:HAS_FREQUENCY]->(n)
CREATE (a)-[:HAS_FREQUENCY]->(n)

The indexes are properly created. The nodes being looked up are already indexed (by name):

> call db.indexes() yield description, properties where properties[0] contains "name" return description, properties

╒═══════════════════════════╤════════════╕
│"description"              │"properties"│
╞═══════════════════════════╪════════════╡
│"INDEX ON :Ancestry(name)" │["name"]    │
├───────────────────────────┼────────────┤
│"INDEX ON :Gene(name)"     │["name"]    │
├───────────────────────────┼────────────┤
│"INDEX ON :Phenotype(name)"│["name"]    │
├───────────────────────────┼────────────┤
│"INDEX ON :Variant(name)"  │["name"]    │
└───────────────────────────┴────────────┘
(Andrew Bowman) #6

Firstly, try upgrading to 3.5.3. There have been several bugs fixed from the earlier releases, we'll want to rule out anything that's already been fixed.

Next, let's see if dialing back the batch size to 10k can work better. Per row you're performing a max of 3 MATCHes, a MERGE (plus properties), a node CREATE, and a CREATE of 4 relationships. 6 new nodes/rels being created, * your batch size, = 300k items created per batch when your batch size is 50k, but if we dial it to 10k then you have 60k items created per batch, which may be more manageable (depending on your available memory and configuration).

You may also want to EXPLAIN your query and add the query plan (with all elements expanded), there could be a clue in there as to what's going on.

(Massung) #7

Andrew,

I'll try the upgrade and see what happens (post to follow later). I don't see anything in the release notes that would explain this, but can't hurt to check.

As for the batch size, it used to be at the default of 1K with the same issues. I increased it to 50K hoping it could do a bigger disk write at one time as opposed to lots of small writes; again, not knowing the internals or being able to see what's happening, I was grasping at straws. But, I'll reduce it for the test.

Attached is the explained plan (looks pretty clean to me).

(Massung) #8

After upgrading from 3.5.1 to 3.5.3, the same set of part files took only 25 minutes to fully upload.

During that time period there were 3 checkpoints that triggered and the CPU and IOPS profiles looked exactly as I would have expected them to.

It would be nice to know what - exactly - from 3.5.2 or 3.5.3 fixed this issue (otherwise I'm left wondering if it's just a one-off success from the server having been restarted), but so far I'm very pleased with the results. I'm going to let it run over night uploading more of the data and see how it does.

Update

It's definitely doing better than it was, but still has periods of complete flat-lining. It appears to recover, but each checkpoint appears to - on average - be a 15 to 20 minute affair before concurrency and "processing" picks back up.

I would fully expect that it would be reading, loading, and processing as much in RAM as possible up to the checkpoint, which would mean little-to-no disk access and good CPU usage. When the checkpoint hits, write IOPS should jump and CPU should lower until the checkpoint is complete. And that seems to play out. But, again, once it appears that the checkpoint has completed (signaled by writes dropping back to near 0 MB/s), the processing of the LOAD CSV (and any other queries) fail to pick back up again, and this is where the concern lay.

Again, 3.5.3 seems to improve things, it's just not where I would expect it to be. And I would still very much like to know what what addressed (in 3.5.2 or 3.5.3) that improved things. Mostly because I need to prove to myself that this isn't going to pop up again as a major roadblock in the future once we've started loading 10x the amount of data.

Thanks!

(Tonny008) #9

3.5.4 still has this problem. I use periodic.iterate Load CSV to daily update neo4j. the cypher is very simple. as the loaded data increase the checkpoint time increase, it could take hours per checkpoint. and then the server become unresponsive.