Neo4j Enterprise 4.0 on GCP doesn't work out of the box

Hi there, I'm launching Neo4J Enterprise on Google Cloud Platform. At the time of this writing, the current version of neo4j is 4.0.2.

That being said, when I launch a cluster in GCP the cluster successfully is created. Great. :+1:

Then when I go to IP address associated with the cluster to login to ensure the cluster is operational, a host of issues immediately arise.

  1. Websocket connection issue
  2. Authentication issue

How has this effected us?
I've lost days debugging these issues. I dont have days to lose: I'm in a time sensitive project.

How should it be different?
I expect Neo4j Enterprise to work, right out of the box. You deploy a cluster. Upon successful cluster creation, you go to the designated cluster entry point, auth in, and bam, you're in and everything works.
Instead, the current experience is after successful cluster creation, you cannot auth in, and you lose days trying to figure things out, leaving you frustrated enough to write in this forum. lol :laughing:

Steps I've taken to resolve the issue:

  1. The first issue the websocket issue. I noticed the error says your browser's security does not allow for this. I noticed the SSL cert was bad, so I figured they were connected. I followed @david.allen 's wonderful blog post, but its out written for v3, not v4, so the core settings @david.allen suggests are no longer relevant . Please, @david.allen, consider writing another blog post for v4. It'd help the community save hours of work, I'm sure of it. After hours of hacking at the neo4j's settings (trying to translate what David was saying for v3 to v4), I was able to get the single Neo4j VM to serve SSL cert as the default cert. I could successfully go to our neo4j sub-domain (which now had a valid cert) but found that the web socket issue persisted, FML, lol. So apparently the SSL cert is not related to the fact the Neo4j has a websocket issue. I did find that if opened port 7474 in the GCP Firewall, and then went there, I could auth in, without a websocket issue. So HTTP + auth=false => no websocket issue. HTTPS + auth=false => web socket issue still...hey...at least HTTP was kinda working, right? Turns out auth errors / write errors began at this point (second issue described below).

Summary: it looks like the web socket issue isnt related to the a valid SSL cert being present, its more like it has to do with security setting neo4j.conf. I'm not sure which security setting its related to. You can disable security and hit HTTP to get around the issue, but thats a security flaw, not a solution for an enterprise setting.

  1. I can now auth into the database with HTTP (I'll take the mini-victory), but these errors then arise:
  • After successfully logging in:
    ** Cannot write to database, only a leader can write, this a follower. (note: every VM in the cluster says this. So you cannot write at all? :laughing: Who the heck is the leader -- I just need to write to the DB damn it, lol)
    ** Eventually, it says cannot write / read, because you've tried to authenticate too many times from this client...making me unable to run any commands (note: auth=false, so why is this popping up?)
  • Eventually, when I go back to the Browser interface for the DB I see:
    The client is unauthorized due to authentication failure (even though auth=false and it previously just allowed me in with the same user / pw). Now I can no longer auth into the DB, FML again.

I'm about to destroy this cluster, and going to try repeat this insane set of steps to try to resolve these issues. I'd sincerely appreciate help from anyone who knows how to solve any of these issues

To the fine folks @ Neo4j:
Look, I'm no noob...I've built two graphs with billions of nodes each, operating at web scale, but I gotta say, this experience seems whack, even to a senior software engineer. lol

I expect more from Neo4j Enterprise. I expect it work out of the box. I cant tell you how many times I've torn down this cluster, recreated a new one trying to resolve these little bugs and just trying to get v4 to work. I, and I'm sure the community agrees, expect there to be no websocket issue. If there is a prerequisite step that your engineers are aware of that's needed to resolve the issue, then a patch should be released or the solution should be baked into deployment script (ie, if the websocket issue was caused by missing SSL certs, then when making a cluster, drag and drop your SSL cert here to configure your cluster).

Btw, v3.5 worked beautifully right out of the box. No issues. Its just 4.0 that has these issues. I'm losing days on this...I'm about to lose another day... :tomato:

Can you answer these questions:

  1. How do you solve the websocket connection issue for neo4j 4.0? It's preventing developers from authing into the DB :cry:
  2. Are there additional steps required to make a GCP Cloud Deployment work? Why doesn't initial username / initial password work after sometime? Are you supposed to change it?
  3. RE: Follower vs leader issue: who am I supposed to be asking to do a write?

Thanks for being as patient with this topic as I've been with the issue :laughing:

  • neo4j version: 4.0.2, browser version: 4.0.5

Let's try to tackle your three questions:

How do you solve the websocket connection issue for neo4j 4.0?

You are right, they are generally related to self-signed SSL certificates (SSCs). The browser usually throws up a scare warning about these, but then also the Neo4j driver under the covers really doesn't like to accept them by default, for security reasons. That's what I think is happening: you get to the browser page, and then fail to login with the websocket error because the driver that Browser is using won't accept self-signed certificates.

The solution to this one is going to be to get valid signed certificates, i.e. with LetsEncrypt. Instructions are here, but yes beware -- I haven't updated this blog post for 4.0 settings, so you might have to cross-check differences in the settings. https://medium.com/neo4j/getting-certificates-for-neo4j-with-letsencrypt-a8d05c415bbd

Are there additional steps required to make a GCP Cloud Deployment work? Why doesn't initial username / initial password work after sometime? Are you supposed to change it?

It should work out of the box. I'll take up the issue with our drivers team and browser team to see if we can get defaults published that will work with SSCs. The trouble here is that in a self-deployed cloud scenario, you have to start with SSCs.

The initial username and password does work, and you don't have to change it (we automatically generate a secure one for you) - I think you're not getting as far as the username/password check, because of the SSC issue.

RE: Follower vs leader issue: who am I supposed to be asking to do a write?

Short answer: to do a write you have to talk to the leader. Longer answer: Neo4j uses a smart client routing approach, and has a cluster architecture where different machines in the cluster adopt different "roles" with respect to the data. This means that your client has to know where to route writes, and that's always to the leader.

Really long answer if you want to know how the guts work: https://medium.com/neo4j/querying-neo4j-clusters-7d6fde75b5b4

Cross-post from another thread discussing the same thing: Websocket connection failed - possible certificate chain issue

I want to add some follow-up and fix instructions. Here's a config snippet with some relevant things:

     dbms.ssl.policy.bolt.enabled=true
     dbms.ssl.policy.bolt.client_auth=NONE
     dbms.ssl.policy.https.enabled=true
     dbms.ssl.policy.https.client_auth=NONE
     dbms.connector.https.enabled=true
     dbms.connector.bolt.tls_level=REQUIRED

Two things to get this working with SSCs, that is without getting signed certs.

  • First, make sure to set client_auth to NONE as in the example above. The product default is asking the client to pass certs, which is gumming up the process. We'll look to fix this in the next cloud release.
  • Second - different browsers handle these policy issues differently. In Chrome, when you "Trust" the HTTPS cert, it does not trust the cert on port 7687. This is relevant because Neo4j Browser makes a connection on the bolt port. So you have to convince Chrome to accept the cert on port 7687 as well.

To accomplish this, first make sure you've disabled client auth. Then, visit https://myhost:7687 -- we don't really care what this page has (in fact the page will be broken because HTTPS isn't bolt) -- but it will prompt Chrome to get you to accept the cert on this port. Once that's done, you should be able to log in with an SSC using HTTPS. This time the login will succeed, because Chrome trusts the same cert on port 7687. Browser will make that connection, and it will work.

Hope this helps.

@david.allen you rock for answering this! I haven't tried the snippets you provided yet.

Instead, I created a new cluster, and cannot auth into it given the initial password / user.

Screenshot of the successful deploy on GCP.

try to login at:

(note: this cluster is demonstration purposes, so its okay to share the creds in this circumstance)

you should see this unauthorized error:

PS: web socket issue seems to have went away with chrome 81? Looks like Firefox still has it. I'll confirm on the next cluster I create if web socket issue went away with the Chrome update :laughing:

@david.allen can you go to the URL confirm if you're able to auth in with these creds? Maybe a bug with the current deploy script? If you can auth in, please let me know too.

Thank you. In the mean time, I'll spin up another cluster :smiley:

(using latest image released today 4/16/2020: neo4j version is 4.0.3)

Sorry - for various security sensitivity reasons, I'm not willing to log into machines that folks stand up. Please have a look at the previous guidance I posted in detail -- and please do avoid posting login credentials (even if temporary).

:+1:

Cool, you dont have to use my cluster to replicate the bug. Just launch a new GCP cluster right now and observe the results, following the steps I outlined above.

It looks like the neo4j team deployed v4.0.3 to GCP today (4/16/20) at 12:31 PM. New clusters are not allowing developers login. To validate this is an ongoing issue, I've created 2 more clusters outside the test cluster I shared. Every new cluster is reporting the initial username / initial password is invalid.


(screenshot of another newly launched cluster, that says the initial user / initial password are incorrect)

Please take 5 mins to launch a new cluster on GCP, go to the URL provided, submit that cluster's initial username and password and observe the results. Please let me know if you're able to replicate the issue on your end

PS to other people searching for bug fixes:
Just to point out: this is not an issue in v3.5, nor did I observe this most recent issue with v4.0.2. This is occurring on the v4.0.3 image on GCP released today. So if anyone wants something that works, I cannot recommend 4.0, due to these numerous issues you're seeing discussed in this post, but I can say version 3.5 worked out of the box. I may drop hope for v4.0 and revert to v3.5.

Whats a developer got to do to get horizontal scaling to work on a graph DB? :broken_heart: :laughing: :broken_heart:

@david.allen, my teammate and fellow developer figured out the issue.

All new clusters are setting username to neo4j and password to neo4j...instead of the initial cluster password. Hence if you use the initial password, you get an auth error. Yet if you use neo4j as the password, you're in the door.

This should be considered a security vulnerability. Please release a patch and fix.

This issue should only effect any new clusters made on neo4j at the time of this writing.

Now that we resolved that issue, moving onto to fix the other issues. More to come :slight_smile:

Investigating this today with help from my colleague CC: @bledi.feshti1

Thanks for hanging in there ;) your linux username is permanetly in our VMs...literally..and for legendary purposes too :slight_smile:

Also, websocket issue went away with Chrome 81. Still exists for FF, IE, etc.

We can now successfully auth in and populate the cluster. One last question...is horizontal scaling enabled by default or is there a switch we need to flip?

Thanks again :smiley:

@NawarA A patched 4.0.3 is now available and you can login using the Initial password displayed.
Regarding your question, if you want to scale reads you have to deploy more Read Replicas.
If you want to scale writes, adding more hardware to the leader can help, you cannot do that by horizontal scaling,

1 Like