Understanding Load Balancing In Neo4j Causal Cluster Enterprise

Hello everyone,
I'm here today because during my use of neo4j enterprise I have seen that my cluster doesn't behave as expected.
As I'm working on Google Cloud Platform, I have used community Helm GitHub to easily deploy Neo4j on a private GKE cluster.
NOTE : I juste saw that the github was updated ~20 days ago and that Neo4j now gives out their own helm charts, will try to use them.
However, there is a problem, we use neo4j in the following situation:

  • Lot of modules (from 0 to 500 / modules written in python which are updating or reading neo4j)
  • A Web front which can:
    • Retrieve data from our neo4j cluster
    • Add some data
    • Launch the modules
      The Neo4j cluster is a standard minimal one, 3 CORES and 1 READ-REPLICA. All of this being accessed using Google Cloud Internal Load Balancer.

Here is how the module use the Neo4j driver:
Connection:

from neo4j import GraphDatabase

class Neo4j:
    def __init__(self):
        url = "neo4j://GCP_LOAD_BALANCER:7687"
        user = USER
        password = PASSWORD

        self.driver = GraphDatabase.driver(url, auth=(
            user, password), max_connection_pool_size=1000)

    def close(self):
        self.driver.close()

Query:

@unit_of_work()
def _unit_function(tx, query, params):
    try:
        result = tx.run(query, params)

        return True
    except ne.TransactionError as error:
        logging.error(f"error in _unit_function: {error}")
        raise

def example_function(self):
    with neo4j.driver.session() as session:
        result = session.read_transaction(unit_function, query, params)

Sadly, according to Neo4j metrics and GCP metrics, only one or at most 2 members of the cluster are being used. The main one being the leader, which accept both READ/WRITE even though other members aren't used.
I don’t understand why connections aren’t routed to the least used member.
WRITE are OK, only the leader can accept those transactions but why are READ transactions either used directly to the current node even if there is better options? Queries can then take up to 6 minutes because the leader is taking them all while other members aren’t doing anything.

Is there something wrong with the base configuration of neo4j? Server-side routing is enabled, and each member is designated as a SERVER for routing.

Any ideas?

Thanks a lot, and happy new year to everyone :blush:

You should upgrade to the new helm charts, and should not use the deprecated neo4j-helm on github. That's my first most important suggestion.

If you want to stick with the old stuff, you need to read these instructions which deal with external network exposure through load balancers of Neo4j clusters. There's a lot of information here: External Exposure of Neo4j Clusters - Neo4j-Helm User Guide

In general, based on Neo4j's architecture, with the older helm charts it is not safe to use load balancers in front of them unless you follow the above instructions. This constraint does not apply to the newer helm charts, which is why I'm suggesting you migrate.

Hello David,

Thanks for replying. I will deploy using the new charts to see if it helps :slight_smile: .
Will update the post accordingly !

But Neo4j Exposure make use of a external load balancer (which use an external IP, redirect internet traffic to our pool/instance), I suppose that the internal load balancer (which only redirect traffic to gcloud instance) should work too ?

Good day !

It depends on what kind of traffic you're load balancing. With the old helm charts, it doesn't matter whether it's internal or external, if you're trying to LB or proxy bolt traffic on port 7687 you're likely to encounter trouble because of how the client protocol works. To use an LB for HTTP traffic would in most cases be OK

What do you mean by "trouble" ? Real trouble like connections errors or like routing/load balancing not working properly ?

I am currently trying to install a cluster using the "new" charts but the old one were way more practical for newbies in helm charts. The new neo4j charts are for some reason failing at forming a cluster when using my custom kubernetes serviceAccount while if I don't give any serviceAccount name there is no problem... In the documentation they are telling us that the charts won't create a serviceAccount using the given name but they still create one if none was given ?

On another note, I modified the internal load balancer to redirect traffic to both core and read-replica and observed a better query repartitions between them (by better I mean that now, every member (core/replica) is used at some point), still the leader is experiencing too many read while writing in my opinion.

To my understanding, the drivers should be connecting to the least busy member of the cluster ? What is used to determine which member is "busy" ? Is there usage threshold or metrics that justify using this or this member ?

By "Trouble" what I mean is that the issue is complicated; a load balancer takes traffic from a client and distributes it across the cluster, but this clashes with Neo4j's default "smart client routing" model, where the client needs to choose which machine in the cluster the query goes to. When you put a LoadBalancer in front of a cluster without enabling server-side routing, the "trouble" you get into is client-side errors. The client tries to route to a particular member, the LB undoes that, and you get errors.

I was trying to keep it on the simple side, but the more full answer is this -- if you read this article, it'll tell you how querying clusters in Neo4j actually works at the client layer, and explain why load balancers are actively harmful. Querying Neo4j Clusters. How Neo4j clusters and smart query… | by David Allen | Neo4j Developer Blog | Medium

Now, this doesn't mean that you can't use an LB with Neo4j. Just that if you want to expose traffic externally by an LB, you must either follow the linked instructions I gave earlier in the thread (hard) or upgrade to the newer helm chart (easier)