Triggers -- matching vs filtering on trigger criteria

(Neo4j) #1

I read David Allen's blog post: Streaming Graph Loading with Neo4j and APOC Triggers and made some comments there. He eventually directed me here.

The premise in his post is that incoming IOT data records gets inserted directly into Neo4j. When an IOT record arrives it gets inserted into Neo4j as a node that is never intended to be stored as such. Instead a trigger recognises it by its signature label then extracts the data and enriches the graph with nodes and relationships derived from it, and discards the original node.

Incoming IOT record inserts might look like these:

CREATE (:FriendRecord { p1: "David", p2: "Mark" });
CREATE (:FriendRecord { p1: "David", p2: "Susan" });
CREATE (:FriendRecord { p1: "Bob", p2: "Susan" });

I added a WHERE clause to David's example and here it is:

CALL apoc.trigger.add('loadFriendRecords',
   UNWIND apoc.trigger.nodesByLabel({assignedLabels}, 'FriendRecord') AS iotnode
   WITH iotnode 
     WHERE iotnode.p1 = 'David'
       AND (:Person {name:iotnode.p1})-[:FRIENDS]-(:Person {name:'Mark'})
   MERGE (p1:Person { name: iotnode.p1 })
   MERGE (p2:Person { name: iotnode.p2 })
   MERGE (p1)-[f:FRIENDS]->(p2)
   DETACH DELETE iotnode
   RETURN p1,f,p2
 { phase: 'after' })


1.) In this transaction we take an incoming event stream data record, stuff its contents into a newly created node--basically a temporary node--use it to create some proper graph objects, and then delete the temporary node. All of this happens in miliseconds, inside of a transaction. Do these intermediate nodes ever actually get written to disk or do the exist only in memory?

2.) If I will be using a specific node property only for the purpose of being matched by a trigger, and will always immediately dispose of the node, is there any reason to index that property? Would it make the trigger any more efficient?

3.) The method apoc.trigger.nodesByLabel appears to be plural for 'nodes'. Will this function ever return more than a single node?

4.) In the example above it looks like there is initially 'match' action for each individual new node of type 'FriendRecord', then in the WHERE clause it's a filtering action. Is there any benefit to making more of the filtering happen before the WHERE it even possible?

5.) Are there more examples outside the documentation showing useage of triggers?

(M. David Allen) #2

Some answers.

First -- on the "temporary node" strategy, with this blog post I wanted to show how you could implement a certain pattern. You don't have to have temporary nodes. The data could come in, in the right format from the beginning, but often this won't be an option so the temporary node strategy shows how you can make it work with anything that shows up, effectively doing the ETL bit in cypher.

Question 1 --- yes, it's going to get written to disk. Neo4j guarantees durability of transactions that succeed. That means (in part) that a transaction can't succeed until it's confirmed written to disk. In this way, if you were to yank the power cable out of the machine, you couldn't lose data about a transaction that worked. Were Neo4j not to behave this way, there'd be scenarios where a TX could succeed and you'd still lose data which is very bad.

Question 2 -- indexes are to speed lookups. So if in your transformation cypher you don't need to look something up by a property value (on the temp node) then there's no need for an index, in fact it's unnecessary extra overhead.

Question 3 -- yes. In the case where someone elsewhere in the DB committed a TX with lots of nodes all sharing a label, you'd get more than one at a time. If the system that's writing the temp nodes into your neo4j is batching them, then getting more than one at a time would be the norm.

Question 4 -- you can have as many filters setup as you want ahead of time. In general the more you filter what youre processing the better you'll perform just because the DB has to look at less / do less work. So by using the WITH clause in cypher you can filter things down ** before ** passing on to a later query processing step. For example:

MATCH (f:Foo)
WHERE f.x = 1
MATCH (f)-[:whatever]->(b:Bar)

This is sort of akin to hinting to the cypher execution engine that the f.x filtering should happen before the relationship traversal. Now, usually that would happen anyway, and there wouldn't be an advantage. But sometimes if the way the query gets planned isn't optimal and you know better, you can modify the query in this kind of a way to help cypher find the most efficient path. Unless you're dealing with lots of data, this may not matter, and in this context, this trick I'm pointing out is unlikely to provide a big win. 9 times out of 10 it's best to express your cypher query in a way that's clear, clean and logical to you, and let cypher worry about the best way to execute it.

Question 5 - not sure off hand, but if you search around for APOC and triggers you're bound to find some, particularly probably on stackoverflow.

(Neo4j) #3

Thanks David for nailing every question! The sample script in answer 4 is really helpful. I know that 'premature optimisation is the worst evil' but as a longtime database person I can't not think about the damage I'm causing with a run-away query. And so if I can help the execution engine avoid that last traversal step unless it's useful I will.

You've piqued my interest with answer #1. This may seem like special use-case, where the incoming node is immediately dissected and discarded. But I expect it will be quite common since triggers require a CRUD action. You basically have to start by creating some node, in order to get a trigger to fine. If the end output of the transaction does not include writing that 'intermediate' node to disk, it would be ideal if it actually never writes it to disk. In my own logic that doesn't even violate acid compliance. So far, in my experiments, I have not found any intermediate nodes floating around during my hacking around on triggers. Which makes me think (and hope) that in the all-or-nothing assessment that happens with every transaction, that these intermediate nodes get rolled back upon failure. I suppose it does not change how i will code, but it could be a consideration in the engineering of Neo4j for efficiency.

Thanks for your help.

(M. David Allen) #4

So on question 1 -- there are a lot of other options that don't require an intermediate node written to disk. As I was mentioning earlier that blog post was about showing what's possible with pure cypher when you can't control the outside environment. But off of the top of my head, there are other options.

  1. If the incoming message is coming from kafka, you can use neo4j-streams to process it, which uses a different cypher approach which may be easier and doesn't involve intermediate nodes.
  2. You can structure the message to be friendly as an input to begin with so that intermediate nodes aren't necessary.
  3. You can use an intermediate ETL layer (Pentaho Kettle, Google Cloud Dataflow, Amazon Kinesis, there are many) to transform data and insert into neo4j

The "intermediate node" approach takes a very restricted set of assumptions and shows how to work with them, but if you relax some of those assumptions much more is possible.

(Neo4j) #5

Again, thanks David. In a 24 hour period have answered more of my Neo4j questions Stack Overflow ever has. Hurah!
And now I have some homework to do. :slight_smile: