High CPU usage, AWS AMI Causal Cluster

I have been using the AWS community neo4j AMI for a while now and we just finally decided it was time to move across to running with a proper cluster, we applied for the startup program and got accepted.

I proceeded to launch the causal cluster AMI on aws and then imported the data by doing a restore.
This is where i hit my first stumbling block. (I am no linux expert)
The general procedure i followed was (run the following on all 3 servers)

sudo neo4j stop
sudo neo4j-admin unbind
sudo neo4j-admin restore --from=/var/restore/neo4j/graph.db --database=graph.db --force

Except i could not stop the neo4j service until i used "sudo systemctl stop neo4j", i continued did the restore and then struggled to start the service i assumed i would need to start using "sudo systemctl start neo4j" but that did not work but "sudo neo4j start" did.

The cluster has been up for just over 2 days now but i am seeing the CPU usage of all instances sitting at +- 75% permanently, this concerns me because of the way AWS bills their usage.

I noticed that the CPU usage started at the exact moment i imported the data and switched my application across to use the cluster

Compared this to the CPU usage of the single instance community DB it seems like the enterprise causal cluster has some huge cpu overhead unless i have done something horribly wrong (likely the case)

Have you looked at the logs to see if there is any strange repetitive activity going on? I would start with /logs/debug.log There are a couple of procedures you can run as well to check the health of the cluster.
call dbms.cluster.overview();
That will show you all the members of the cluster, their role and status. It should stay pretty consistent. If you have a member that is constantly dropping and rejoining you could have a connectivity issue between the servers.
call dbms.listTransactions();
That will show you if you have any rouge transactions running. I'm not sure why switching to a cluster would do that, but it's something to check.
I've never used restore to seed a cluster. My typical process is to stop the servers, unbind the existing database and delete the /data/databases/graph.db directory. Then copy the desired graph.db directory to the /data/databases directory. This may be a restored database using neo4j-admin restore but only run it once and copy the resulting directory to each server in the cluster. Make sure that the user that will run Neo4j is the owner of all the files in all the directories under graph.db. Once all the members of the cluster have the exact same copy of graph.db go ahead and start each using neo4j start. I would avoid using sudo instead create a neo4j user and make sure that it has the permissions it needs to run.
I found this blog useful. http://byteus.tech/neo4j-causal-cluster-backup-and-restore/
HTH

1 Like

The cluster overview has not changed since i started the cluster up. I also dont see any transactions apart from the actual "call dbms.listTransactions();" i assume it shows itself in the list of transactions.

checking the debug log i see this printed out +- every 4 seconds

2019-12-03 12:55:51.531+0000 WARN [o.n.i.p.PageCache] The dbms.memory.pagecache.size setting has not been configured. It is recommended that this setting is always explicitly configured, to ensure the system has a balanced configuration. Until then, a computed heuristic value of 6011582464 bytes will be used instead. Run `neo4j-admin memrec` for memory configuration suggestions.

Is there some missing configuration?

With regards to the sudo vs neo4j user i am not sure what to do there because i am using the AWS AMI that is distributed by neo4j. Would this AMI not have setup a user by default already?
If neo4j had been started by a specific user would the sudo neo4j stop not stop that neo4j, and could i have inadvertently started two competing neo4j services by running the "sudo neo4j start"?

I think it is definitely two instances running,

How do i know if i can safely kill the one and allow the other to take over.
Also i assume that i should be leaving the one that is running under the 'Neo4j' user and kill the one running as root. which one is actually serving the DB atm??

I managed to fix it.

I stopped all instances those running as root and as neo4j ran the unbind command on all.
Then i figured out that the neo4j user did not have permissions to the graph.db folder i assigned those and then the service started up.

what actually was causing the high CPU usage was the instance running as the neo4j user was failing to start and kept on trying to restart over and over, it could not start because it did not have the required folder permissions.

In my ignorance i started an instance as root on all the servers which did have file access allowing the db to run and function, but there was still a misbehaving one running as neo4j.

None of the log files for neo4j showed this at all. I ended up using "journalctl -u neo4j" to see why the service was not starting.

image

2 Likes

Docs on the Neo4j cloud VMs can be found here: https://neo4j.com/developer/neo4j-cloud-vms/

I believe this covers the use of journalctl to get the logs, and several other facets that are important to this thread.

To form the cluster did you deploy the marketplace cluster configuration or did you stand up 3 machines with your own custom cluster configuration?

I used the deploy from the marketplace basically no changes at all.
Thanks for your help you definitely led me in the correct direction :slightly_smiling_face:

I am still not 100% sure the exact command i should be running to start and stop the neo4j instances.
With AWS you login as the ubuntu user which already has root access (and no password) you authenticate with the key via ssh.

If i understand the documentation correctly i should be running the start stop etc like

sudo systemctl stop neo4j

and then the admin commands probably like this

sudo -u neo4j neo4j-admin unbind

The correct commands to restart neo4j are going to be sudo systemctl restart neo4j (or start/stop).

The neo4j-admin unbind command is a very different thing - this is used for removing local cluster state from an individual machine. You should do this if you have a cluster formation problem, but you should not do that on a regular basis, since if you're connecting to the same cluster members over time, you want that state to ensure that all machines have a consistent data set.

Thanks,

When running the neo4j-admin commands does it matter what user i run them as?
I noticed if just running "neo4j start" the user is important because it will try and start up an instance running under that user and is how i got myself into this trouble in the first place :/