Feature request: Prevent accidental introduction of new variables without label

I would like to propose a feature request to prevent accidental introduction of new variables without label. It has hurt several times already, here an example:

We had a query where the last both lines were copied from an almost similar query from another part of our code. Accidentally, it was overseen that here “p” is used as variable for Position and in that other query it was “position”.

MATCH (p:Position)
WHERE p.uuid = “dummy”
OPTIONAL MATCH (:User)-[existingRole:HAS_ROLE {responsible:true}]->(position)
SET existingRole.responsible = false

Unfortunately, the result of that query is, that the responsible property was set to false on EVERY HAS_ROLE relation between any user and any other node where it was true before (not even Position, we also have HAS_ROLE relations on companies and others…) – about 150 relations affected, instead of that one that was intended. Same could have happened due to a simple typo.

Regarding the hundrets of queries in our code, and my experience using Neo4j for 9 years, it is really exceptional for us to introduce new variables without a label definition. I guess we have 2 or 3 queries were we do this by intent. And it is a really dangerous behaviour as on big cyphers, it’s so easy not to recognize it and the scope is always "to all". It’s the only pattern that really had damaged our productive data several times (luckily able to fix it).

My proposal is: Introduce a setting to prevent the introduction of new variables without a label definition unless a pattern like (variable:null) or (variable:) is used. In that way, the writer would admit his intent to really want to use a new unlabelled variable, in other cases, an error thrown prevents accidental data damage.

Users who do not like this do not need to use this setting, for all others it would be a massive security advantage.

By the way, the requested feature is exactly the behavior that every new developer in our team who started with Neo4j would have expected. They were all surprised that there was no error but instead sometimes massive data loss in our test database.

Best,
Reiner

To note, it is extremely common in Cypher to not need to provide a label to variables used in a pattern, and we've never had a request like this or a need for this kind of restriction over the years.

Note that when a label is present in a pattern, that affords a greater chance for the query planner to consider that node as an anchor node (whether using a label scan or an index lookup), as a potential starting point into the graph, and a place to expand from (or perform a NodeHashJoin in concert with anchoring and expansion from other nodes). More potential anchor nodes introduces greater chances for the planner to get this wrong, or to use an inefficient plan by anchoring to certain nodes and performing inefficient expansions when trying to solve the pattern.

The bigger problem with the proposal is that if that feature existed and was used for this query it still would not have prevented the query from incorrectly modifying the graph. It may have reduced the amount of data incorrectly modified (only matching to paths where position was a :Position labeled node), but it still would have resulted in incorrect and undesired behavior.


Instead of your proposed change, which would not have prevented erroneous changes to your data, maybe there's something we could consider for the OPTIONAL MATCH pattern.

In most cases (though not all), OPTIONAL MATCH is meant to extend from an existing variable in scope. The problem you ran into is that you used a new position variable, instead of the p variable you wanted, and no other part of the pattern uses an existing variable (nothing was connected to anything you did prior).

It could be useful to have some kind of warning in place to warn or error if an OPTIONAL MATCH pattern is used when it doesn't reuse an in-scope variable (when it's not extending from anything you matched on previously). That would have stopped you at the right point, and prevented inadvertently performing incorrect changes.

That said, it wouldn't make sense to extend that kind of behavior to regular MATCH patterns, as these are regularly performing new lookups of paths that are not connected to previously matched nodes or relationships.

Hi Andrew, thanks for your quick reply. Yes maybe I couldn't make it clear enough. Your statement "when it doesn't reuse an in-scope variable" is exactly what I meant. But we would also apply it to MATCH statements (except the first one in a query) as we often also use subsequent MATCH patterns to extend from an existing variable where we know or expect that those paths exists. E.g. in my example, when we know each position has HAS_ROLE relations we would have used MATCH instead of OPTIONAL MATCH but still had the same risk.

When implemented with a switch in settings, users could choose if they like to prevent. And as said, a pattern like (variable:) could be use to tell that it's intended.

But a warning wouldn't help, as we don't see those warnings in our C# environment using Neo4jClient.

Meanwhile, we found a neat workaround: In Neo4jClient, we added an OnOperationCompleted method to the GraphClient that checks each query with RegEx in debug/test mode for such unwanted variable usage. So we are at least notified during the tests. Unfortunately, this only happens after query execution, but at least before release for production.