Missing or double reads

Hello all,

I'm experiencing random missing or double reads with my queries and would be glad to get a definite answer on whether or not this is expected. I have already read the dedicated section on the doc.

My data model looks like so:

I created it with the following query:

MERGE(c:Container{})
WITH c
UNWIND [1,2,3] as id
MERGE(c)-[r1:Owns]->(e1:ConstituentA{id: id})
MERGE(c)-[r2:Owns]->(e2:ConstituentB{id: id})
RETURN c, e1, e2, r1,r2

But that's only demonstration data. The real, problematic business logic involves two distinct steps:

  • ConstituentA nodes are created first, in one transaction each, using a query that is very similar to the one above
  • Once they are all created (all transactions are complete and committed), I run dozens of concurrent requests that fall into two categories:
    • some create ConstituentB nodes, in the same way ConstituentA were created
    • others look up ConstituentA nodes using a query that resembles the one below:
OPTIONAL MATCH(c:Container)
UNWIND [{
    label: "ConstituentA", id: 1
}, {
    label: "ConstituentA", id: 10
}] AS constituentDef
OPTIONAL MATCH(c)-[:Owns]->(constituent{id:constituentDef.id})
WHERE
    (constituentDef.label = 'ConstituentA' AND constituent:ConstituentA)
WITH c, constituent IS NOT NULL as constituentExists
RETURN c IS NOT NULL AS containerExists, collect(constituentExists) as constituentsExist

In other words, there are dozens of transactions creating ConstituentB nodes linked to the common Container node with Owns relationships while multiple other transactions try to find ConstituentA nodes by traversing Owns relationships starting on the same Container node.

A typical correct result based on my demo data above would be:

╒═══════════════╤═════════════════╕
│containerExists│constituentsExist│
╞═══════════════╪═════════════════╡
│true           │[true, false]    │
└───────────────┴─────────────────┘

But on some occasions (maybe 1 out of 20), a ConstituentA with the expected id cannot be found, or I get two occurrences of it, leading to the constituentsExist to contain one too many item.

To give a rough sense of the numbers involved: the typical use case involves the creation of 50 ConstituentB nodes while looking up 50 ConstituentA nodes in 50 distinct transactions.

My question is: is this a case where the concurrent matching and creation of Owns relationships on the same node (Container, here) can lead to missing or double reads? The doc is not very clear on that point and since the nodes being created do not hold the same label as the ones being looked up, I was not expecting the issue to occur.

Any help very much appreciated! Thanks for reading thus far :slight_smile:

I think the problem is in this section:

OPTIONAL MATCH(c)-[:Owns]->(constituent{id:constituentDef.id})
WHERE
    (constituentDef.label = 'ConstituentA' AND constituent:ConstituentA)

constituentDef.label is a property of the node called “label”, not the node’s labels. Label constraints are applied using the node:label syntax in the MATCH pattern.

So the above should be written as:

OPTIONAL MATCH(c)-[:Owns]->(constituent:ConstituentA {id:constituentDef.id})

(I’m not sure about the WHERE constituent:ConstituentA - I’ve never used that pattern for node label constraint, I’m surprised it’s valid. It’s probably not doing what you hope, and hence the weird results.)

Not quite: constituentDef is a map variable coming from the UNWIND above, not to be confused with constituent which is the node. Probably not my best idea to have used so close a name.

I’m not sure about the WHERE constituent:ConstituentA

To be honest, that's not how I implemented it initially either. I used the traditional pattern with the label. But that's how it is rewritten in the execution plan. And as far as I can tell, that works identically.

Ahh, I see. Always more to learn about Cypher. :grinning_face_with_smiling_eyes:

If I’ve understood correctly:

  1. The A nodes are all created.
  2. Then in parallel, the B nodes are created and the A nodes are queried.
  3. Your A nodes queries have unexpected results.

I can’t think of any explanation, most like it’s a subtle bug in the queries. Are you sure it only happens during the parallel activity of querying As and writing Bs? Can you reproduce the problem either without writing Bs, or after writing a bunch of them?

If you can narrow it down to a static graph and a query that gives the wrong result, it’d be much easier to look for the problem in the query.

Try this: Code in Bold:

OPTIONAL MATCH(c)-[:Owns]->(constituent{id:constituentDef.id})

WHERE

(constituentDef.label = 'ConstituentA' AND constituent:ConstituentA)

//return count(constituent) as cnt

with distinct constituent, c

WITH c, constituent IS NOT NULL as constituentExists

RETURN c IS NOT NULL AS containerExists, collect(constituentExists) as constituentsExist

from this creation, are you sure you don't actually get multiple ConstituentA nodes with the same id? Do you have a uniqueness constraint to ensure that?

The MERGE on a pattern will check if the full pattern exists and if not, create anything not bound (so c is fine, both r1 and e1 would be created even if there is a matching e1 as long as there wasn't a r1).

Double reads might at least be explained if the data is duplicated (though that wouldn't cover the missing ones) and if it's always the same id that are duplicated


Yeah, it's the same thing as putting it in the MATCH clause :D


This could be a good idea, to make sure you don't see the same node twice for some reason. Otherwise I agree with David that it would be good if we could det it down to a reproducible smaller static example to be able to investigate what's going on.

Thanks all. Let me try to answer all of your questions.

@david.pond yes, you've summarized the context pretty accurately.

Are you sure it only happens during the parallel activity of querying As and writing Bs

I could only reproduce the problem in that context, yes. The backend code (interacting with the DB) has been used for months without a single occurrence of that issue. I saw it only when running a specific client program against the backend, which does more intensive operations (that client is intended to be an example application).

Can you reproduce the problem either without writing Bs, or after writing a bunch of them?

No, I could not. In hundreds of runs, the query reading ConstituentA nodes always behaved correctly if no writes occurred concurrently. I also confirmed that running the reading query twice sequentially (in separate transactions) shows that when one fails, the other does not. To me, that confirms that there is a timing / synchronization issue between transactions.

If you can narrow it down to a static graph and a query that gives the wrong result

Agreed. I already tried to write a simple C# program reproducing the issue, with no luck thus far. If it is a timing issue, simplifying the code may very well make the issue harder to reproduce. I'll try again though!

@ameyasoft

Try this: Code in Bold:

I am not sure how the DISTINCT keyword would help with the missing reads, but I'll try that and will let you know, thanks.

@therese.magnusson

from this creation, are you sure you don't actually get multiple ConstituentA nodes with the same id

Yes, the query that creates nodes guarantees id uniqueness. I don't have any native constraint guarding against that (it would be possible for some nodes but not all) but I have checked in the DB that there are no id collision.

Double reads might at least be explained if the data is duplicated

I agree with your analysis but unfortunately, I already confirmed it is not as simple as an id duplication (I wish it was :slight_smile:)

if we could det it down to a reproducible smaller static example

Yes, 100% agreed. I'm so frustrated that I have not been able to put a simple reproducer together yet!

All, thanks again for your time. Even though you have not cleared up the mystery yet, your answers already confirm that there is no obvious mistake on my part. I will keep digging and will focus on trying to reproduce the problem on a small program that I can share here.

The one point none of you really touched on is the possibility that Owns relationships being created and traversed simultaneously could put the query in a situation described here. After all, if relationships are indexed by label, iterating over relationships of the same label may be impacted by insertions within that index occurring at the same time, don't you think? But if that's a possible explanation, I would be terribly disappointed that Neo4j cannot handle such a simple scenario reliably.

The way this query is written, it will be executed for all Container nodes. I assume that is your intent, and you want to know if these two ConstituentA nodes exists for each Container node.

One issue I see is with your “collect” logic. Neo4j groups the records to collect based on the uniqueness of the grouping variables. These are the variables in the ‘with’ or ‘return’ statement that are outside of the ‘collect’ statement. In your case, you are grouping on ‘containerExists’ values, which are only ‘true’ or ‘false’. As such, you should get only two rows from the query with the collected array containing all the constituentExists results for Container nodes found and Container nodes not found. Here is a example of the behavior:

Here is a refactored query that removes that behavior and is a little easier to read. You may want to return a unique identifier for the Container node, so you know how the data is related.

WITH [
  {label: "ConstituentA", id: 1}, 
  {label: "ConstituentA", id: 10}
] AS constituentDefs
OPTIONAL match(n:Container) 
RETURN n is not null as ContainerExists,
COLLECT{
  UNWIND constituentDefs as constituentDef
  RETURN EXISTS((n)-[:Owns]->(:$(constituentDef.label){id:constituentDef.id}))
} as ConstituentsExists

Note: The link you provided to the index behavior was interesting. I did not know that, and now am suspicious of the accuracy of index scans in a highly concurrent environment. I would figure the index update would be part of the write transaction, so the data and index are consistent.

I am not sure that would explain your situation though, as you are getting your constituentA node by a path expansion from the Container node. This would not use an index. An index would come into play with your index if you had some property constraints on your match for the Container node. You query has none, so it will do a table scan on lable.

The way this query is written, it will be executed for all Container nodes. I assume that is your intent, and you want to know if these two ConstituentA nodes exists for each Container node.

You are right. But there is only one Container node in the sample data, as you can read in the query I used to populate the DB. That was my intention to keep the queries as simple as possible. My real-life data model has multiple of those container nodes but the first match comes with additional constraints that guarantee there is only one node matched.

I did not know that the collect() function behaves differently from the COLLECT subquery, which kind of makes sense now that you have said it. However that does not explain the inconsistent results I get since I am 100% sure there is only one Container node matched by my query.

I would figure the index update would be part of the write transaction, so the data and index are consistent.

They surely are. But any write lock acquired by a transaction would not prevent other transactions to read data, so they can see a sudden reordering of the index while they are iterating over it. Which is VERY scary.

I too am now very paranoid about when it is safe to trust the index in a read-only transaction.

Anyways, thanks for taking the time to investigate and provide an interesting point about collect() vs COLLECT {}. I have other use cases of collect() that I'll give a quick reread just for the sake of safety :sweat_smile:

Your welcome.

I use the subqueries a lot: exits, collect, and count. The are expressions, so they are very flexible.

I am not sure that would explain your situation though, as you are getting your constituentA node by a path expansion from the Container node. This would not use an index.

I think there is an expand option where you find both endpoints of the relationship and then expand between them, in which case you could use an index for the constituentA nodes :thinking: But I'm currently not 100% sure of it :person_shrugging:

Quick update before wrapping up another day of desperate debugging: I noticed my version of the .NET driver was very old (5.23.0, mid 2024). Upgrading to the latest 6.0.0 improved overall performance by roughly 50% but I still encounter inaccurate query results from time to time. Maybe less often but I am not sure.

I added the "WITH DISTINCT" suggested by @ameyasoft, which did not get rid of the problem. That did change the behavior a bit, still: after migrating to the latest driver, I noticed that I was seeing double reads only and no longer missed reads. Adding WITH DISTINCT made the missed reads happen again.

It really looks to me that this is a nasty timing issue. When changing the implementation, I most likely alter the timing ever so slightly, which explain why the symptoms vary. The only thing that does not vary is that I cannot depend on the queries because they fail every 30 attempts or so.

It’s useful when debugging to know how neo4j has planned your query. If you could post the text output from EXPLAIN that might give us some clues; more information on that here

Please share a screenshot of double reads and some other nodes around it. Just a wild chase!!

There is nothing more than the graph visible on my first post. The missing or double reads are not for a specific node. They happen randomly and are obviously triggered by a bad timing in the concurrency.

I have finally managed to reproduce the issue with a simple C# program. I am currently cleaning that up so that I can share it here. You will then be able to see for yourself

I'll share the execution plans as well. I already checked them before posting here but that did not help much. Neo4j's documentation is too shallow on what operators really do, especially on what they do with indexes. And that is a problem not to know :slightly_frowning_face:

Stay tuned!

Thank you very much.

We’re very open to this feedback, if you can give me more context about what you’d like to see in the documentation for these operators then we can work with the docs team to make that happen

Hello everyone,

Sorry for the dry spell. It took me a while to prepare a reproducer that is easy to execute, but here you go:

Neo4jReproducer.zip.txt (12.1 KB) (zip file disguised as a txt file :detective:)

It contains a C# program with a ready to use Docker Compose project. All instructions are in the README.md.

Note that the bug is still not reproducible on demand. Make sure to rerun the program if you do not reproduce the issue, or use the --attempts option (or the REPRODUCER_ATTEMPTS environment variable) to let the program repeat the scenario multiple times until it fails. I usually manage to reproduce it within 20 attempts.

The compose setup embeds an instance of the Aspire Dashboard that collects logs and traces. Very handy to visualize what the program does and where it failed.

Let me know if you managed to reproduce the problem or if you have any questions :slight_smile:

Basically, I'd like to know what indexes an operator is likely to use and if it is safe to mutate that index while this operator is being executed. From the documentation, I understood that it is never safe to do so but from this thread I am no longer so sure about that.

Just question: When creating ConstituentB nodes from the selected ConstituentA nodes,are you creating the ConstituentB nodes with matching ids from ConstituentA? Let me know. Thanks