Occasionally getting ServiceUnavailable: Cannot acquire connection

We are running the enterprise edition on an AWS ubuntu EC2 server, it is a single instance and not a cluster, and has been long running (months) with no problems. We hare recently introduced a system health check which fires ever 30 seconds and together with our normal operations appears to be fine, except occasionally at different/random times 2 of the health checks fail in succession returning a neo4j.exceptions.ServiceUnavailable: Cannot acquire connection to Address(host='[our db]', port=7687).

The next minute and everything is fine again.

I've checked the logs and there are no corresponding errors for the same time, the server remains up, and normal service is immediately resumed for the next check.

We are making the request from another AWS EC2 instance running Django/python with the neo4j driver.

How does your health check work?

@david_allen it's a health check on a load balancer that calls one of our URL endpoints which makes a call to our graph to return a piece of data. It runs every 30 seconds and is absolutely fine for most of the time, as are all the other calls we make to the graph the same way. But occasionally we are getting this error and it concerns me in case it is something that would hamper our scalability.

I'm not understanding. The connection error is on port 7687 which is usually bolt but the way you're describing it, it's hitting some kind of URL which implies HTTP/HTTPS on port 7474 or 7473.

Also, where is the health check deployed? Is it possible that high latency and/or network issues are causing recoverable errors in making a one-time connection to Neo4j?

For general health checks, I'd recommend these endpoints and not a bolt/7687 based approach. Are you using any of these?

Thanks for answering David. The health check is on our AWS infrastructure and is the load balancer for our hosted application server, the ELB pings a URL on our application server which uses the same connection details to make a call to the graph and it uses bolt as do all our application servers to make neo4j calls. My concern is why very sporadically I'm getting 2 of these calls failing together, maybe only once every 2 days, which means this has successfully executed over 5000 in-between. I'm sure there are better ways to do the health check directly to the neo4j instance, but I'm worried that this error may be caused by open connections, load, or something that would cause issues as we scale. We are not running a cluster.

I'm not really sure why this would be happening which is why I'm seeking to gather more information about how the health check operates and what your other opportunities for a health check are. The thing is, if your health check is failing, but at the same time you're reporting that the database is still functional and available at the same time, then this argues that the health check is not doing its job properly, which is why I was seeking to understand that better.

The next thing I would recommend is to try and look at the contents of debug.log and see if you can time isolate when this is happening, and what the machine is chattering to itself in the logs when you encounter this.

@david_allen thanks for this, I really appreciate you taking the time to try to help, and I just want to be clear that the health check is an internal application server to graph query as opposed to a neo4j function. The point is that it's a query from our application server that happens ever 30 seconds and fails only around once a day. When I look in the logs near the time that this has happened I see nothing within minutes of when the call is made, and I know that it's fired a number of times successfully within that period.

@david_allen would you happen to know why I would get that error sometimes? literally I know that the call worked twice in the moment before and twice in the moment after.

As I said I'm not entirely sure why this is happening. The details of how you've implemented your health check matter, and so it's hard to come further without an investigation of that. If you're an enterprise customer I would be happy to get you connected to a field engineer, or you can submit a support ticket to dive deeper into what's going on.

we are using enterprise edition through the startup programme.

This is the most relevant bit you've provided so far:

We have others who might be able to help on this forum, but we cannot proceed without more detail:

  1. Some specifics about how you've implemented your health check. These details provided aren't sufficient, and I'm not following them because they mention port 7687 (which is bolt) when you said that an ELB pings a URL (which implies HTTP, on port 7474)
  2. Some detail about how you know that database functioning is normal while your health check fails. This is critical to understand -- because if you're right that database functioning is normal (you have said debug.log didn't contain anything, so that seems right to me) -- then normal debug.log together with failing health check strongly suggests that the health check is not implemented correctly.

@david_allen I think I mentioned the health check is our own health check, it is part of our application load balancer which calls a URL end point at our application server (which is a django server running on an AWS EC2 instance) which calls a method within our application server which makes a call via the bolt connection to the neo4j instance on a separate server using bolt, the same as all our application server methods do. It is not a neo4j health check.

I know that the application server is able to make the connections because they happen 2 per minute successfully most of the time.

Also I can use our application (a website accessible via a browser which calls endpoints on our application server which in turn use the same connection settings to make a bolt connection to the graph to return data. And this appears to be working as well.

I only mention "health check" because that is what the load balancer at AWS refers to the URL entered that checks every 30 seconds for a valid response at the application server and if it doesn't get one it spins up another instance of our application server. The method called connects in the same way as all our application server to graph server connections do using bolt.

which calls a method within our application server which makes a call via the bolt connection to the neo4j instance on a separate server using bolt

Makes which call via which bolt connection? Have you verified how your code is making that connection, and that your connection pooling / driver settings are correct?

I know that the application server is able to make the connections because they happen 2 per minute successfully most of the time

Most of the time -- not all of the time? What happens the other times? When the application server makes these connections, presumably they're doing different cypher queries, possibly via a different connection pool?

I only mention "health check" because that is what the load balancer at AWS refers to the URL entered that checks every 30 seconds for a valid response at the application server and if it doesn't get one it spins up another instance of our application server

Right....got this. This implies that it's the app server failure which causes the load balancer/health check failure and gets your app server spun up again by ELB, I suppose. Presumably the root cause of that is the cypher connection error.

Without the app code (which I understand you might be reluctant to share on a public forum) you might not get to the bottom of this. But things to look at --

  1. In the way your app server uses the driver code, you need to audit this and be sure you understand how connection pooling in the driver works and that you're using it appropriately, and not creating additional driver object instances
  2. You may want to investigate Neo4j Metrics (check our docs, you can do it by prometheus or via CSV files) -- neo4j itself will give you stats on ongoing connections to the database that you can use to look innto whether the connection is failing on the app server side (never arriving at the database) or failing on the neo4j side.

My best guess is that something about the way your app server is written is not using the driver appropriately, but I can't tell without diving into the source.

we are using the neo4j python library, this is how we get a connection and execute a cypher query:

from neo4j.v1 import GraphDatabase

neo = GraphDatabase.driver( NEO_URI,auth=(NEO_UID,NEO_PWD) )
	with neo.session() as session:

		res = session.run(
				"MATCH (x)-[co:CLIENT_OF]->(p:Persona {bjid:{pid}}) "
				"WHERE x.bjid={uid} "
				"RETURN co.avatar_base64 as avatar ",
				{
					'uid': uid,
					'pid': persona_bjid
				} )
		
		for match in res: return Response( { "status": "OK", "img": match["avatar"] if match["avatar"] else "" } )

The problem is not related to the ELB spinning up a new instance of the app server, the health check doesn't fail enough times to make this happen, which is how I know that most of the time it works. When it fails it fails twice, which means to attempts within 1 minute (30 seconds apart) but this only happens at most once a day and sometimes it's fine for days.

I don't believe we've got any different connection pooling or method for any of the other cypher queries that get executed successfully.

My concern was that we may be hitting some limit, open connections, connections not closing properly or long running queries, but I see nothing to else to suggest this in the logs and I'm out of ideas.

I really appreciate your help but I'm pulling my hair out.

Please have a look at the python driver docs and pay close attention to the connection pooling descriptions

https://neo4j.com/docs/api/python-driver/current/driver.html

If every time you do a check to your Neo4j database you are creating a new driver, this has the effect of creating a whole pool of connections for each check, and you're likely spamming the database with connections you're not using. At some point the server may be getting overloaded with unused connections. Best practice is to reuse driver objects and not recreate them every time.

Thank you @david_allen that sounds like the sort of thing I need. Is there any way I can see the number of connections/drivers?

Please read the docs. It's specified behind that link; there's a default and it's configurable.