How to avoid duplicate nodes with "empty space" or "null"

Hello everyone, my name is Hanis and I am new with Neo4j.

Recently I try to build a project with the bank transaction data. Due to duplicating the real life situation, I leave some of the data in my csv blank to create missing data. Currently my code is using MERGE as shown below. However, the problem arise when the "Unknown" in my code to handle missing data make wrong relationship connection.

LOAD CSV WITH HEADERS FROM "file:///Dummy001v3.csv" AS row
WITH row LIMIT 15

MERGE (a:PrimeCIF{cifNum:coalesce(row.PrimeCIFNumbe, "Unknown")})
MERGE (b:PrimeAccount{accNum:coalesce(row.PrimeAccountNumber, "Unknown") })
MERGE (c:PrimeName{name:coalesce(row.PrimeClientName, "Unknown") })
MERGE (d:PrimeIDReference{idRef:coalesce(row.PrimeIDReference, "Unknown") })
MERGE (e:PrimeHAddress{houseAdd:coalesce(row.PrimeHouseAddress, "Unknown") })
MERGE (f:PrimeOAddress{officeAdd:coalesce (row.PrimeOfficeAddress, "Unknown") })
MERGE (g:PrimeCountry{country:coalesce(row.PrimeClientsCountry, "Unknown") })
MERGE (h:PrimePhoneNumber{mobileNum:coalesce(row.PrimeMobileNumber, "Unknown") })
MERGE (i:PrimeHouseNumber{houseNum:coalesce(row.PrimeHouseNumber, "Unknown") })
MERGE (j:PrimeOfficeNumber{officeNum:coalesce(row.PrimeOfficeNumber, "Unknown") })
MERGE (k:PrimeEmail{email:coalesce(row.PrimeEmailAddress, "Unknown") })
MERGE (l:PrimeRisk{riskLvl:coalesce(row.PrimeRiskLevel, "Unknown") })

MERGE (a)-[:RELATED]->(c)<-[:RELATED]-(b)
MERGE (d)-[:RELATED]->(c)<-[:RELATED]-(e)
MERGE (f)-[:RELATED]->(c)<-[:RELATED]-(g)
MERGE (h)-[:RELATED]->(c)<-[:RELATED]-(i)
MERGE (j)-[:RELATED]->(c)<-[:RELATED]-(k)
MERGE (l)-[:RELATED]->(c)

There will be a data repetition (ex: in term of "PrimeName" where they got more than one account) which make using CREATE is not possible (I want to avoid)

Well, the merge is going to result in just one node with the ‘unknown’ value for each of the node types you define. As such, you are going to get relationships from data in different rows pointing to these unknown nodes.

You could use create instead of merge if the data in each row is independent of the other row, thus creating makes sense. The draw back is the script will keep generating the same results if run multiple times. It’s not a problem, but something to be aware of. When running multiple times during testing, you will just need to delete previous data before each test.

I do question your data model though. Do you really want each attribute of a single account to be separate entities? Would a model that has ‘Account’ and ‘People’ entities work for your scenario? You would split the row values among these two entities and relate them with something like a ‘HAS_ACCOUNT’ relationship from a person entity to an account entity. If you want to track multiple people sharing an address, you could break that out to a separate Address entity. I guess the same for email and phone number if looking for groups of people sharing these is important for your analysis. Anyways, it seems you could group the attributes into larger entities that represent your data and facilitate your analysis.

Is it a real possibility that an account would not know the name of the account holder nor the account number. Simulating some of the attributes to have null values seem counter intuitive.

1 Like

Thanks a lot for your input, Gary Lilienfield.

Yes, my aim is to track if there is multiple people sharing the address, email, phone number and etc. I would like to build something that can visualize the data and to find if there is any connection between them.

Using CREATE is not a good idea as in real life, a PERSON can have more then one ACCOUNT at a time.

In your opinion is it okay to just add the "DELETE" command for all the "Unknown" node after done run the existing command? What will be the drawback on doing this?

cypher doesn't have the equivalent of an 'if' operation, so you can't conditionally create only the nodes that are not null. I was able to figure out a solution that leverages the 'call subquery' feature to conditionally create only the non-null nodes and relationships between them. It is a little complex and may not be what you want for your solution. You would have to have around 23 of the 'call' clauses to address each of your node types and the corresponding relationships. I included the code just in case and it is interesting as well.

I don't see a problem deleting the 'unknown' nodes afterwards. Just remember to use DETACH DELETE instead of DELETE, as the 'unknown' nodes will have relationships to other nodes.

load csv with headers from "file:///Book1.csv" as row
with row.A as a, row.B as b, row.C as c
call{
with a
with a
where a is not null
merge(x:A {id: a}) on create set x.id = a
}
call{
with c
with c
where c is not null
merge(x:C {id: c}) on create set x.id = c
}
call{
with b
with b
where b is not null
merge(x:B {id: b}) on create set x.id = b
}
call {
with a, b
with a, b
where a is not null and b is not null
match (x:A {id: a})
match (y:B {id: b})
merge (x)-[:REL]->(y)
}
call {
with c, b
with c, b
where c is not null and b is not null
match (x:C {id: c})
match (y:B {id: b})
merge (x)-[:REL]->(y)
}

Screen Shot 2022-03-28 at 11.05.17 AM

1 Like

The code that you give is interesting. I try it on testing data (with 2 variable only) and it work wonderful. I will try use it using all my variable and see how it react!

Thanks again for your help!