cancel
Showing results for 
Search instead for 
Did you mean: 

Deleting / Merging Duplicate Connected Nodes

exb
Node

Hello everyone, 

I have some data in the form:

 

node:Account
{
  "address": "mark.com",
}

relationship:FUNCTIONCALL
{
  "date": "Tue Jan 18 2022 15:40:45 GMT+0000 (Greenwich Mean Time)",
  "amount": 1,
  "id": "unique_id",
  "outcome": "Success",
  "timestamp": 1642520445191736600.0
}

 

The idea is that an account should always have a unique address property, and a FUNCTIONCALL relationship between two nodes will always have a unique id property.

To create an account I use MERGE, by  declaring a variable with a unique name, and setting its properties:

 

MERGE (encrypted_mark_address: Account {address: 'mark.com'})
MERGE (encrypted_ted_address: Account {address: 'ted.com'})

 

I then create a relationship with the two, by still using merge: 

 

MATCH (a1: Account {address: 'mark.com'}) 
MATCH (a2: Account {address: 'ted.com'}) 
MERGE(a1)-[unique_id:FUNCTIONCALL {id:'unique_id'}]->(a2) 
ON CREATE SET 
   unique_id.date="Tue Jan 18 2022 15:40:45 GMT+0000 (Greenwich Mean Time)", 
   unique_id.timestamp=1642520445191736600, 
   unique_id.amount=1, 
   unique_id.outcome="Success"

 

However, by running my code I get to a point where I have duplicate data, like in the following example (real data). 

Screenshot 2022-08-28 at 13.24.45.png

The three central nodes are the same, and so are the 5 surrounding nodes. 
Ideally, I should have this instead:

Screenshot 2022-08-28 at 13.29.06.png


My questions are the following:

  1. How do I remove / merge the duplicates without messing up the data? 
  2. How do I change my code to create the data, such that this duplicate situation does not happen again? 

Thanks in advance!

5 REPLIES 5

glilienfield
Ninja
Ninja

Your approach seems correct. Can you provide the full query / driver code and some of that data? 

These are essentially the full queries that I am running to create the data. They are called from Javascript and there are a few additional properties in the relationship, but what I included should be representative enough of the actual query. Same applies for the data, but will attach what you asked below!

JS code to create the nodes:

async function insert_address(address) {
  let session = driver.session()
  try {
    const res = await session.run('MERGE (' + fma(address) + ': Account {address: $address}) RETURN ' + fma(address), {
      address: address
    })
  }
  catch (err) {
    console.error(err)
  }
  session.close()
};

 

JS relationship creation code:

async function insert_relationship(address1, address2, amount, txId, date, timestamp, outcome) {
  let session = driver.session()
  try {
    const res = await session.run('MATCH (a1: Account {address: $address1}) MATCH (a2: Account {address: $address2}) MERGE(a1)-['+ fmtx(txId)+':FUNCTIONCALL {id:$txId}]->(a2) ON CREATE SET ' + fmtx(txId)+'.date=$date, '+ fmtx(txId)+'.timestamp=$timestamp, '+ fmtx(txId)+'.amount=$amount, '+ fmtx(txId)+'.outcome=$outcome RETURN '+ fmtx(txId), {
      address1: address1,
      address2: address2,
      txId: txId,
      amount: amount,
      date: date,
      timestamp: timestamp,
      outcome: outcome
    })
  }
  catch (err) {
    console.error(err)
  }

  session.close()
};

 

Actual data:

Query: 

MATCH (n:Account {address: "mark.com"}) 
RETURN n

Result:

{
  "identity": 2402,
  "labels": [
    "Account"
  ],
  "properties": {
"address": "mark.com"
  }
}

Sample relationship query:

MATCH (n:Account {address: "mark.com"})-[r]->(b:Account) 
WITH DISTINCT r 
RETURN r 
ORDER BY r.timestamp

 

Result:

{
  "identity": 11908,
  "start": 2402,
  "end": 1961,
  "type": "FUNCTIONCALL",
  "properties": {
"date": "Tue Jan 18 2022 15:33:58 GMT+0000 (Greenwich Mean Time)",
"amount": 5.02,
"id": "LePwsL6czmCtfFB3VtPzpmmB7MpVNEMUtaBLw32R2cW",
"outcome": "Success",
"timestamp": 1642520038850413800.0
  }
}

 

According to your sample relationships query, 'mark.com' is only related to one other 'Account' node. Is that the full result of the query, or just the first element in a list?  If it is the full result, what are all the other nodes/relationships in your screenshot? 

If your database consist of only the data in the screenshot, can you provide the data from this simple query:

match(n:Account)-[r]->(m:Account)
return n, r, m

 

Unfortunately, the data is very large, and does not consist of only the data in the screenshot. 
However, without attaching all the data I can tell you that the only thing that changes in the data (for the duplicates) are the identity, start and end values. All the properties are the same.

Will attach a link to the JSON result running the query you asked me to run, additionally specifying the addresses in the screenshot for the nodes n and m.  

Sub Sample data (with addresses from screenshot):
https://jsonblob.com/1013866261813411840

I don't see the data you posted above in the data file, such as, 'mark.com' and node with id = 2402. Anyways, I noticed you have multiple Account nodes with the same address of 'xkcdfan.com' and for 'dragonnation.com'. These should have matched during your 'merge' in your js script to create the nodes. Your query code looks fine. Is it possible you actually had more than one of each of these nodes before you executed the script? This would explain results. If you don't know, can you run your script on a fresh database to see if you get the duplicates? 

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.