Multiple LOAD CSV operations creating "duplicate" nodes

FourMoBro · July 13, 2020, 10:47pm

New user, about 2 weeks into my Neo4j journey and I have run into an issue I hope the forums can help me out. I tried searching, but no such luck. I am trying to load data into neo4j using a CSV file. I can do the loads just fine. The problem comes when I try subsequent loads of additional nodes and relationships where the data already exists. Instead of appending or reusing the existing nodes, new nodes are being created, and thus my graph is breaking. I thought that by having unique constraints on my data would help, but to no avail. So my code looks like this:

In the first load....

LOAD ... AS row
MERGE (t:AXX {id:row.table_num})
MERGE (d:DXX {id:row.doc_num})
MERGE (t)-[:RXX] -> (d)
ON CREATE SET
t.table_title = row.table_title,
etc
d.doc_title = row.doc_title
etc

Everything works fine and looks good. Now I have another csv with more nodes to create, but the csv file does not have all the "attribute" information for my "t" variable like the first csv file. So when I load it using this....

LOAD ... AS row
MERGE (t:AXX {id:row.tables}) 
MERGE (a:EXX {id:row.sys})
MERGE (t)-[:SXX] -> (a)
ON CREATE SET
a.sname = row.short,
etc

... it ends up creating new nodes for "t" instead of "merging" or "reusing" the existing "t". These new nodes don't have all the attribute information that came in the first csv load. I did read somewhere that MERGE was an "all or nothing" type deal, but how do I get around that? I don't want to add a bunch of cells to the csv file. What am I missing?

kathryn.moore · July 13, 2020, 11:41pm

You could try using the APOC procedures, something like this...
CALL apoc.merge.node(['AXX'], {id: row.tables}) YIELD node
MATCH (t:AXX {id: row:tables})
MERGE (a:EXX {id: row.sys})
MERGE (t)-[:SXX ]->(a)

FourMoBro · July 14, 2020, 1:40am

Would you use the apoc procedure during the LOAD? or after the load?
I am using the community edition and just got that procedure library loaded. I just tried using the syntax during the LOAD CSV, but it did not solve the problem. I wish there was more documentation covering this beyond https://neo4j.com/docs/labs/apoc/current/overview/

kathryn.moore · July 14, 2020, 2:07am

There is this example of importing using APOC procedures using a JSON format - Neo4j: Exporting a subset of data from one database to another | Mark Needham
You could try something similar with apoc.load.csv.

luanne · July 14, 2020, 2:48am

Hi @FourMoBro

What is row.tables? Is it the primary key of your t node?

Is it the same as row.table_num in the previous

MERGE (t:AXX {id:row.table_num})

?

If not, then you will get "duplicate" nodes.
More info about the behaviour of merge is described here Cypher MERGE Explained | GraphAware (disclaimer: I wrote it)

Let me know about the id's, happy to help

FourMoBro · July 14, 2020, 3:38am

I appreciate the timely respones @luanne and @kathryn.moore.

I will look into both of the links you sent. To answer luanne's q, yes, row.tables would be the PK in the second spreadsheet, in which the values should be found already from the first LOAD CSV operation.

I tried to mockup something in google sheets. This is the first load operation with a bunch of attributes on the "table node", and the "doc node". This works just fine.

Now I need some system application info. So the second csv file resembles this:

Now, no matter what I try, it creates new "Table" nodes instead of finding the ones from the first load. I should have 1, A001 node, with relationships to the DOCS( from the first load), and the SYSs (from the second load).

Before loading into my database, I do have a unique constraint:


CREATE CONSTRAINT UniqueTableID ON (t:AXX) ASSERT t.id IS UNIQUE

FourMoBro · July 14, 2020, 3:45am

OK, I figured it out, and I feel like such a fool.
My code was correct to begin with. I just took for granted the quality of the data within the CSV file. My "table_num" column had trailing spaces where the "tables" column did not. So there....

wirselix · August 10, 2020, 1:44pm

Hi @FourMoBro,
space characters at the beginning or end of strings is a common problem in all data processing pipelines. Sometimes also lower case versus upper case creates similar problems. That's why I think the following guideline/rule is important: when using "match" or "merge" always use the unique key constraints and prepare the value with a "trim()" statement and (if possible from semantical perspective) enforce lower case with "tolower()".
Example: "merge(p:person {name:tolower(trim('a string'))}" where "name" is the property that holds the unique key constraints

Topic		Replies	Views
Loading multiple CSV files in Neo4j with each row of each CSV being a node Import / Export load-csv	7	6125	February 24, 2022
Unique Nodes loaded from CSV file Neo4j Graph Platform migrated	2	172	June 9, 2022
Merge Nodes and Relationship Cypher cypher	14	548	July 31, 2020
Migrating Database to Neo4J as a newbie Newbie Questions apoc	12	413	June 1, 2021
Create 1 node for use during CSV loading Import / Export cypher , load-csv	3	1398	September 25, 2018

Multiple LOAD CSV operations creating "duplicate" nodes

Related topics