Hi there,
I am very new to Neo4j and coming from an oracle background. I have an application that has two instances running connecting into a single Neo4J Database. There is a scheduled job that timed to go off at the same time every day that will go to the database and query for work it needs to do.
Needless to say that I want to be sure only one of these instances will carry out this work. The will involve querying the database and update a property on a node plus some other tasks that are non-DB related.
In oracle, I would do a select for update and so lock these rows that will be updated and so the other instance would not be able to update these rows, so is there something similar in neo4j.
Regards
There isn't an exact equivalent, but neo4j is an acid database. If you do a CREATE or MERGE operation to modify data in the database, the database takes locks so that no one else can modify data at the same time. If someone did try, they'd fail because during that transaction the data is locked.
In Neo4j there's also an idea of causal chaining; every transaction can return a "bookmark". Clients can make queries as of some bookmark state to ensure that you've got everything up to date.
Not sure that is my solution been trying so many different ways but nothing seems to work.
What I have it two instances of my application running.
Each one has a cypher running on a scheduled job to grab a data for the DB and send it off to a service. Problem is that each instance grabs the data and sends it off so duplicate data is received by the service.
I have a flag set to true when the data is got from the DB.
so my query is along the lines of
Match (n)
where sentData=false
SET sentData=true
return n.
I was hoping that the data would be locked with the SET in the MATCH query but it's not the case as the two instances grab the data at the same time.
In oracle, for example, I would do a "select for update" then one instance would grab the lock during the select and when doing an update one instance would do it and the other one would throw an exception.
Neo4j is causally consistent, which means that if you cause a change you can easily read your own writes. Full description is here: Introduction - Operations Manual
The way this is done is via bookmarks. So the reason 2 separate applications are probably not reading one another's writes is that they are not coordinated around a common bookmark. When you modify the node and set the flag, it's probably locked for writing, but not locked for reading.
The best solution I can offer is to coordinate the bookmarks.
The problem is the lock is taken in the SET, but that's happening after your read of the property, so it allows a race condition.
I'd advise a double-checked locking pattern.
MATCH to your nodes as before (but please use labels and possibly index lookup, you don't want to do all node scans) with your WHERE clause, and then lock on the nodes (you can use apoc.lock.nodes() if using APOC Procedures, or add and remove a label otherwise to initiate a lock), and then double check the condition by repeating the WHERE clause (but for this one it's happening after you've locked the node, so you can be absolutely certain the condition can't change) and then set the data.
Something like:
MATCH (n)
WHERE sentData=false
CALL apoc.lock.nodes([n]) // now you have mutual exclusion
WITH n
WHERE sentData=false // double check
SET sentData=true
RETURN n
The reason to do the check twice is that locking on nodes you don't intend to do anything with can be problematic, and you don't want to lock all nodes in your database (or even a moderate subset) if you have no idea if they need to be updated or not. So you check for those that need an update, lock those, then make sure no other transaction has gotten to them first, then handle the ones that still need an update.
I don't believe we block except for write locks in Cypher (though one exception might be schema locks). You would still need to double check after the lock.
This is very interesting. I was assuming that Neo4j used optimistic concurrency control, so that a transaction would not succeed if the underlying data changed since it was read. It looks like my assumption is wrong based on the recommendation to use double-checking locking. Does this mean this pattern has to be followed for all updates that are based on query criteria to ensure that the results still meet the query criteria once locked?
If you're anticipating concurrent writes that could result in race conditions, you should probably lock and double check to be safe.
Keep in mind also that we don't know the criteria for which you are selecting a node, and whether that's just one step in a path to get to interested nodes, or some criteria that must hold true throughout the update.
Consider that a pattern does not have to have a limit for the nodes in the pattern, or the criteria applied for the nodes in the pattern to reach nodes you want to update. Consider that procedures can be used for pathfinding to find nodes as well, and may not be limited by distance. Consider also that we can use as many MATCHes or procedure calls as we need in a query before we reach a point where we need to write data. If the requirement was to lock all nodes touched that made up a MATCH pattern (or all MATCH patterns or similar that we've executed up until that point) until filtered out or until the query committed, our throughput would be terrible due to massive lock contention, and we'd have deadlocks all the time.
This makes a lot of sense. I guess I was thinking something like hibernate or SND, where there is a version that can be used to detect a change in the entity when it is persisted on commit. I realize this type of facility is not natively available. Your suggestion make sense if there is a high expectation of concurrent read and updates.