Deleting Duplicated Relationships Based on Date

carib · February 14, 2023, 6:10pm

Hello,

What is the most efficient way to delete a pair of relationships where all of the non-date properties are the same except for the date time property? I would like to keep the relationship with the older date between the two, and delete the relationship with the newer date.

I currently have this query, but it's not really working because the apoc.coll.max() function returns the older date twice. When I schedule the query below on a daily basis, I see that only the relationships with the most recent dates remain which is not what I want.

//First find pairs of duplicated relationships where all of the properties are the same except the date.
MATCH(m:Machine)-[r1:ACTIVITY]->(p:Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE (r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id) AND   r1.date_activity <> r2.date_activity
//With function to collect the pairs of dates and get the most recent date
//which is what I want to delete
WITH m,
    apoc.coll.max([r1.date_activity, r2.date_activity]) AS most_recent_date,
    p,
r2 AS Machine_Activity
//Now this finds the relationship that needs to be deleted
//This does not seem to be working as this deletes the older relationships which is not what I want
// I want to delete the relationships with the most recent dates
MATCH(m)-[r:ACTIVITY]->(p)
WHERE r.date_activity = most_recent_date, r.work_id=Machine_Activity.workd_id, r.storage_id=Machine_Activity.storage_id, r.device_id=Machine_Activity.device_id
DELETE r

Can someone please give me some suggestions on improving this query?

Thank you

glilienfield · February 14, 2023, 8:23pm

Are you using actual date types for the date_activity properties? I found the max function works with actual dates:

with date('2024-01-01') as r1, date('2023-01-01') as r2
return apoc.coll.max([r1, r2])

I tested this with actual date types.

MATCH(m:Machine)-[r1:ACTIVITY]->(p:Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE (r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id) AND   r1.date_activity <> r2.date_activity
WITH {r:r1, date:r1.date_activity} as r1_data, {r:r2, date:r2.date_activity} as r2_data
WITH CASE WHEN r1_data.date > r2_data.date THEN r1_data.r ELSE r2_data.r END as r_to_delete
return r_to_delete

carib · February 16, 2023, 8:14pm

I like the approach of using case statements. It's a lot shorter. hmm I think I will go with this. I'll test over a couple of daily updates before I mark this as the solution.

I was wondering I could impose a restriction on the WHERE statement to have cypher always check that the date on R2 is greater than the date on R1. I was thinking of something like this:

// Reason why I don't specify the Node Name below on the incoming relationships on
// Generic Process is because the machine has activity
// on 4 different Processes with different Node labels but equal property names
MATCH(m:Machine)-[r1:ACTIVITY]->(Generic_Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE (r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id) AND   r1.date_activity < r2.date_activity
WITH
            m.machine_id AS machine_id,
            Generic_Process.process_id AS process_id,
            r2 AS Duplicated_Record
MATCH(:Machine {machine_id: machine_id}-[r:ACTIVITY]->(p {process_id: process_id})
WHERE
r.date_activity = Duplicated_Record.date_activity, r.work_id=Duplicated_Record.workd_id, r.storage_id=Duplicated_Record.storage_id, r.device_id=Duplicated_Record.device_id

But your approach seems to be a whole lot quicker!

carib · February 16, 2023, 8:20pm

And how does cypher know between r1 and r2 which r to assign to r_to_delete? I'm guessing that's based on the previous CASE statement but I can't see how either r1_data or r2_data gets assign to r_to_delete after the CASE WHEN statement because you're comparing dates on those CASE statements.

glilienfield · February 16, 2023, 9:03pm

You made a good observation in your post about adding a condition to the 'where' clause to return the path where r1.date_activity >= r2.data_activity. This should work because you will get two paths in the result, one path as m-r1->p<-r2-m and one path as m-r2->p<-r1-m. Instead of figuring out after which data_activity is greater and deleting it, we will just filter to get the path where r1 is larger and delete it. In fact, the solution with the CASE statement results in two rows with the same relationship, which then calls delete twice.

To answer your question, notice that the value returned from the CASE statement is rx_data.r, where the property 'r' from the rx_data map is the relationship. That is how it knows which relationship to delete.

My version may also be faster because I removed the second match. It didn't seem necessary since we already have the actual relationship to delete.

Try this version:

MATCH(m:Machine)-[r1:ACTIVITY]->(p:Process)<-[r2:ACTIVITY]-(m:Machine)
WHERE r1.date_activity >= r2.date_activity AND r1.work_id=r2.work_id AND r1.storage_id=r2.storage_id AND r1.device_id=r2.device_id
DELETE r1

Topic		Replies	Views
Deleting older relationships Cypher performance	10	799	June 14, 2020
Optimization of Delete quey Cypher apoc , performance	1	192	November 6, 2023
Delete duplicate relations Cypher	9	5970	October 6, 2021
Cannot delete node<id>, because it still has relationships. To delete this node, you must first delete its relationships Cypher apoc	5	2417	August 30, 2021
Delete a realtionship if exists and create new one Cypher cypher , graphql	2	1462	December 10, 2021

Get Certified in June!

Deleting Duplicated Relationships Based on Date

Related topics