Driver hanging and eventually failing with empty read buffer

Hi @AndyHeap-NeoTech and @charlotte.skardon,

We're currently having an issue after updating from the 1.x neo4j-dotnet-driver to the 4.x driver. For the most part everything is working fine, until we tried to run an export via async session that writes to a file, with 16M+ records. Writing that many records to a file is a requirement that I can't work around unfortunately.

Typically, the driver hangs indefinitely. It does manage to write the first 1.5MB to disk, but then doesn't continue, usually.
I was able to get a run to export 13M of the 16M rows before it failed, after 10hrs+, with the empty read buffer / session timeout error.

I came across this post that sounds almost identical. The error is the same at any rate, when it does error out.

We're running driver v 4.1.1 and pointing to neo4j-enterprise v3.5.17 on a 3 server cluster.

Prior to the upgrade to the dotnet driver, the export was running in approx 20 minutes. So there's definitely something going on with it. I initially was only trying to find out if maybe the fetchSize wasn't compatible with v3.5.x. or there was something other than setting fetchSize to Infinite to try.

Any help would be appreciated!

Also, should I have a support ticket created to track this? I wasn't sure if the support contract covers the drivers.

Thanks!
-Mike French

Good Morning Mike,

This does look like it could be related. I have a couple of questions and suggestions to try.

  1. Could you try running on the 4.0 driver and let us know if the problem persists? There was an internal change between 4.0 and 4.1 and this will help to check if that is involved in some way.
  2. What does the query look like? You don't have to supply the actual query if you don't want to, but a general idea would be useful if I have to start thinking about duplicating this and making a test data set.
  3. Are you using transaction and/or transaction functions, or are you using the auto commit functionality e.g. session.RunAsync("my query");

Thanks
Andy

Hey @AndyHeap-NeoTech thanks for responding,

A little more info:

We skipped from 1.7.2 directly to 4.1, so I can't say right off if 4.0 had the issue. I can set up another test but with 4.1.1 it takes over 10hrs before it fails, that might have to be an over the weekend thing.

I will say, I threw together a console app with driver 1.7.2 (because this was blocking a deliverable) and it ran in the expected 15-20min range.

Code (filewriting bits and business logic removed error handling is outside this block):

var session = _driver.AsyncSession();

try {
  result = await session.ReadTransactionAsync( async tx => 
  {
    var cursor = await tx.RunAsync(query, params);
    
    while (await cursor.FetchAsync()) 
    {
      Output(cursor.Current);
    }
  }
}
finally
{
  await session.CloseAsync();
}

Query Pattern (pretty simple one just a lot of data):

MATCH (n)<-[:REL_1]-(root:MyLabel_2021_1_1]-[:REL_2]->(w)
RETURN 
   root.prop1, 
   root.prop2, 
   w.prop3,
   root.prop4,
   n.prop7,
   CASE WHEN toFloat(n.prop7) > 0 
      THEN (root.num / toFloat(n.prop7) * n.prop8 ELSE 0 END as calc1

The return is 18 props, mostly from root.
each root has exactly 1 REL_1 and 1 REL_2.
There are anywhere from 15M to 41M root nodes depending on the label (we're using label suffixes for temporality performance).
The planner always uses the root label as intended.

The code for 1.7.2 is about the same.

I did try setting the fetchSize to infinite and upping the Session and Transaction timeouts to 10ish minutes. It still took 10hrs and never completed (didn't return a specific exception though).

Seeing the other post this morning, I wonder if the routing table update may have caused the Exception?

Anyway, best of luck! I've been digging through the driver on github to see if I can figure out the issue as well but haven't stumbled on anything yet (still getting familiar with the code).

On a related note: I really like the way you wrapped and call throwing extensions from a helper class. Throw.If..... That's such a great idea for readability!

Thanks,
Mike

Hi Mike,

The reason I suggested trying 4.0 is because there was a low level reworking of the code that reads from the network stream for 4.1.1. Using 4.0 would help to identify/eliminate that as the culprit, and potentially get you up and running with a newer version of the driver without the need for your console app.

I've added this to my to do list and will investigate more thoroughly, once a couple of high priority items are out of the way.

Thanks
Andy