What is the most efficient way to delete a pair of relationships where all of the non-date properties are the same except for the date time property? I would like to keep the relationship with the older date between the two, and delete the relationship with the newer date.
I currently have this query, but it's not really working because the apoc.coll.max() function returns the older date twice. When I schedule the query below on a daily basis, I see that only the relationships with the most recent dates remain which is not what I want.
//First find pairs of duplicated relationships where all of the properties are the same except the date.
MATCH(m:Machine)-[r1:ACTIVITY]->(p:Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE (r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id) AND r1.date_activity <> r2.date_activity
//With function to collect the pairs of dates and get the most recent date
//which is what I want to delete
WITH m,
apoc.coll.max([r1.date_activity, r2.date_activity]) AS most_recent_date,
p,
r2 AS Machine_Activity
//Now this finds the relationship that needs to be deleted
//This does not seem to be working as this deletes the older relationships which is not what I want
// I want to delete the relationships with the most recent dates
MATCH(m)-[r:ACTIVITY]->(p)
WHERE r.date_activity = most_recent_date, r.work_id=Machine_Activity.workd_id, r.storage_id=Machine_Activity.storage_id, r.device_id=Machine_Activity.device_id
DELETE r
Can someone please give me some suggestions on improving this query?
Are you using actual date types for the date_activity properties? I found the max function works with actual dates:
with date('2024-01-01') as r1, date('2023-01-01') as r2
return apoc.coll.max([r1, r2])
I tested this with actual date types.
MATCH(m:Machine)-[r1:ACTIVITY]->(p:Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE (r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id) AND r1.date_activity <> r2.date_activity
WITH {r:r1, date:r1.date_activity} as r1_data, {r:r2, date:r2.date_activity} as r2_data
WITH CASE WHEN r1_data.date > r2_data.date THEN r1_data.r ELSE r2_data.r END as r_to_delete
return r_to_delete
I like the approach of using case statements. It's a lot shorter. hmm I think I will go with this. I'll test over a couple of daily updates before I mark this as the solution.
I was wondering I could impose a restriction on the WHERE statement to have cypher always check that the date on R2 is greater than the date on R1. I was thinking of something like this:
// Reason why I don't specify the Node Name below on the incoming relationships on
// Generic Process is because the machine has activity
// on 4 different Processes with different Node labels but equal property names
MATCH(m:Machine)-[r1:ACTIVITY]->(Generic_Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE (r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id) AND r1.date_activity < r2.date_activity
WITH
m.machine_id AS machine_id,
Generic_Process.process_id AS process_id,
r2 AS Duplicated_Record
MATCH(:Machine {machine_id: machine_id}-[r:ACTIVITY]->(p {process_id: process_id})
WHERE
r.date_activity = Duplicated_Record.date_activity, r.work_id=Duplicated_Record.workd_id, r.storage_id=Duplicated_Record.storage_id, r.device_id=Duplicated_Record.device_id
But your approach seems to be a whole lot quicker!
And how does cypher know between r1 and r2 which r to assign to r_to_delete? I'm guessing that's based on the previous CASE statement but I can't see how either r1_data or r2_data gets assign to r_to_delete after the CASE WHEN statement because you're comparing dates on those CASE statements.
You made a good observation in your post about adding a condition to the 'where' clause to return the path where r1.date_activity >= r2.data_activity. This should work because you will get two paths in the result, one path as m-r1->p<-r2-m and one path as m-r2->p<-r1-m. Instead of figuring out after which data_activity is greater and deleting it, we will just filter to get the path where r1 is larger and delete it. In fact, the solution with the CASE statement results in two rows with the same relationship, which then calls delete twice.
To answer your question, notice that the value returned from the CASE statement is rx_data.r, where the property 'r' from the rx_data map is the relationship. That is how it knows which relationship to delete.
My version may also be faster because I removed the second match. It didn't seem necessary since we already have the actual relationship to delete.
Try this version:
MATCH(m:Machine)-[r1:ACTIVITY]->(p:Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE r1.date_activity >= r2.date_activity AND r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id
DELETE r1