Deleting / Merging Duplicate Connected Nodes

exb · August 28, 2022, 6:16pm

Hello everyone,

I have some data in the form:

node:Account
{
  "address": "mark.com",
}

relationship:FUNCTIONCALL
{
  "date": "Tue Jan 18 2022 15:40:45 GMT+0000 (Greenwich Mean Time)",
  "amount": 1,
  "id": "unique_id",
  "outcome": "Success",
  "timestamp": 1642520445191736600.0
}

The idea is that an account should always have a unique address property, and a FUNCTIONCALL relationship between two nodes will always have a unique id property.

To create an account I use MERGE, by declaring a variable with a unique name, and setting its properties:

MERGE (encrypted_mark_address: Account {address: 'mark.com'})

MERGE (encrypted_ted_address: Account {address: 'ted.com'})

I then create a relationship with the two, by still using merge:

MATCH (a1: Account {address: 'mark.com'}) 
MATCH (a2: Account {address: 'ted.com'}) 
MERGE(a1)-[unique_id:FUNCTIONCALL {id:'unique_id'}]->(a2) 
ON CREATE SET 
   unique_id.date="Tue Jan 18 2022 15:40:45 GMT+0000 (Greenwich Mean Time)", 
   unique_id.timestamp=1642520445191736600, 
   unique_id.amount=1, 
   unique_id.outcome="Success"

However, by running my code I get to a point where I have duplicate data, like in the following example (real data).

Screenshot 2022-08-28 at 13.24.45.png

The three central nodes are the same, and so are the 5 surrounding nodes.
Ideally, I should have this instead:

Screenshot 2022-08-28 at 13.29.06.png

My questions are the following:

How do I remove / merge the duplicates without messing up the data?
How do I change my code to create the data, such that this duplicate situation does not happen again?

Thanks in advance!

glilienfield · August 28, 2022, 8:23pm

Your approach seems correct. Can you provide the full query / driver code and some of that data?

exb · August 28, 2022, 10:27pm

These are essentially the full queries that I am running to create the data. They are called from Javascript and there are a few additional properties in the relationship, but what I included should be representative enough of the actual query. Same applies for the data, but will attach what you asked below!

JS code to create the nodes:

async function insert_address(address) {
  let session = driver.session()
  try {
    const res = await session.run('MERGE (' + fma(address) + ': Account {address: $address}) RETURN ' + fma(address), {
      address: address
    })
  }
  catch (err) {
    console.error(err)
  }
  session.close()
};

JS relationship creation code:

async function insert_relationship(address1, address2, amount, txId, date, timestamp, outcome) {
  let session = driver.session()
  try {
    const res = await session.run('MATCH (a1: Account {address: $address1}) MATCH (a2: Account {address: $address2}) MERGE(a1)-['+ fmtx(txId)+':FUNCTIONCALL {id:$txId}]->(a2) ON CREATE SET ' + fmtx(txId)+'.date=$date, '+ fmtx(txId)+'.timestamp=$timestamp, '+ fmtx(txId)+'.amount=$amount, '+ fmtx(txId)+'.outcome=$outcome RETURN '+ fmtx(txId), {
      address1: address1,
      address2: address2,
      txId: txId,
      amount: amount,
      date: date,
      timestamp: timestamp,
      outcome: outcome
    })
  }
  catch (err) {
    console.error(err)
  }

  session.close()
};

Actual data:

Query:

MATCH (n:Account {address: "mark.com"}) 
RETURN n

Result:

{
  "identity": 2402,
  "labels": [
    "Account"
  ],
  "properties": {
"address": "mark.com"
  }
}

Sample relationship query:

MATCH (n:Account {address: "mark.com"})-[r]->(b:Account) 
WITH DISTINCT r 
RETURN r 
ORDER BY r.timestamp

Result:

{
  "identity": 11908,
  "start": 2402,
  "end": 1961,
  "type": "FUNCTIONCALL",
  "properties": {
"date": "Tue Jan 18 2022 15:33:58 GMT+0000 (Greenwich Mean Time)",
"amount": 5.02,
"id": "LePwsL6czmCtfFB3VtPzpmmB7MpVNEMUtaBLw32R2cW",
"outcome": "Success",
"timestamp": 1642520038850413800.0
  }
}

glilienfield · August 28, 2022, 10:47pm

According to your sample relationships query, 'mark.com' is only related to one other 'Account' node. Is that the full result of the query, or just the first element in a list? If it is the full result, what are all the other nodes/relationships in your screenshot?

If your database consist of only the data in the screenshot, can you provide the data from this simple query:

match(n:Account)-[r]->(m:Account)
return n, r, m

exb · August 29, 2022, 5:44pm

Unfortunately, the data is very large, and does not consist of only the data in the screenshot.
However, without attaching all the data I can tell you that the only thing that changes in the data (for the duplicates) are the identity, start and end values. All the properties are the same.

Will attach a link to the JSON result running the query you asked me to run, additionally specifying the addresses in the screenshot for the nodes n and m.

Sub Sample data (with addresses from screenshot):
https://jsonblob.com/1013866261813411840

glilienfield · August 30, 2022, 4:01am

I don't see the data you posted above in the data file, such as, 'mark.com' and node with id = 2402. Anyways, I noticed you have multiple Account nodes with the same address of 'xkcdfan.com' and for 'dragonnation.com'. These should have matched during your 'merge' in your js script to create the nodes. Your query code looks fine. Is it possible you actually had more than one of each of these nodes before you executed the script? This would explain results. If you don't know, can you run your script on a fresh database to see if you get the duplicates?

Topic		Replies	Views
Merge duplicate nodes into one with relationship Cypher merge	4	4134	December 6, 2018
Delete duplicate data and restore relationship Cypher cypher	2	1789	March 17, 2020
Duplicate nodes merges everything Cypher cypher	3	405	February 2, 2022
Python loading script creates duplicate nodes when creating relationships using 'MERGE' Cypher cypher	1	615	February 25, 2020
Label Duplication Cypher apoc , cypher , import	4	533	March 13, 2020

July Summer Fun!

Deleting / Merging Duplicate Connected Nodes

Related topics