Multiple LOAD CSV operations creating "duplicate" nodes

New user, about 2 weeks into my Neo4j journey and I have run into an issue I hope the forums can help me out. I tried searching, but no such luck. I am trying to load data into neo4j using a CSV file. I can do the loads just fine. The problem comes when I try subsequent loads of additional nodes and relationships where the data already exists. Instead of appending or reusing the existing nodes, new nodes are being created, and thus my graph is breaking. I thought that by having unique constraints on my data would help, but to no avail. So my code looks like this:

In the first load....

LOAD ... AS row
MERGE (t:AXX {id:row.table_num})
MERGE (d:DXX {id:row.doc_num})
MERGE (t)-[:RXX] -> (d)
ON CREATE SET
t.table_title = row.table_title,
etc
d.doc_title = row.doc_title
etc

Everything works fine and looks good. Now I have another csv with more nodes to create, but the csv file does not have all the "attribute" information for my "t" variable like the first csv file. So when I load it using this....

LOAD ... AS row
MERGE (t:AXX {id:row.tables}) 
MERGE (a:EXX {id:row.sys})
MERGE (t)-[:SXX] -> (a)
ON CREATE SET
a.sname = row.short,
etc

... it ends up creating new nodes for "t" instead of "merging" or "reusing" the existing "t". These new nodes don't have all the attribute information that came in the first csv load. I did read somewhere that MERGE was an "all or nothing" type deal, but how do I get around that? I don't want to add a bunch of cells to the csv file. What am I missing?

You could try using the APOC procedures, something like this...
CALL apoc.merge.node(['AXX'], {id: row.tables}) YIELD node
MATCH (t:AXX {id: row:tables})
MERGE (a:EXX {id: row.sys})
MERGE (t)-[:SXX ]->(a)

Would you use the apoc procedure during the LOAD? or after the load?
I am using the community edition and just got that procedure library loaded. I just tried using the syntax during the LOAD CSV, but it did not solve the problem. I wish there was more documentation covering this beyond https://neo4j.com/docs/labs/apoc/current/overview/

There is this example of importing using APOC procedures using a JSON format - https://markhneedham.com/blog/2020/01/27/neo4j-exporting-subset-database/
You could try something similar with apoc.load.csv.

Hi @FourMoBro

What is row.tables? Is it the primary key of your t node?

Is it the same as row.table_num in the previous

MERGE (t:AXX {id:row.table_num})

?

If not, then you will get "duplicate" nodes.
More info about the behaviour of merge is described here https://graphaware.com/neo4j/2014/07/31/cypher-merge-explained.html (disclaimer: I wrote it)

Let me know about the id's, happy to help

I appreciate the timely respones @luanne and @kathryn.moore.

I will look into both of the links you sent. To answer luanne's q, yes, row.tables would be the PK in the second spreadsheet, in which the values should be found already from the first LOAD CSV operation.

I tried to mockup something in google sheets. This is the first load operation with a bunch of attributes on the "table node", and the "doc node". This works just fine.

Now I need some system application info. So the second csv file resembles this:

Now, no matter what I try, it creates new "Table" nodes instead of finding the ones from the first load. I should have 1, A001 node, with relationships to the DOCS( from the first load), and the SYSs (from the second load).

Before loading into my database, I do have a unique constraint:


CREATE CONSTRAINT UniqueTableID ON (t:AXX) ASSERT t.id IS UNIQUE

OK, I figured it out, and I feel like such a fool.
My code was correct to begin with. I just took for granted the quality of the data within the CSV file. My "table_num" column had trailing spaces where the "tables" column did not. So there....

3 Likes

Hi @FourMoBro,
space characters at the beginning or end of strings is a common problem in all data processing pipelines. Sometimes also lower case versus upper case creates similar problems. That's why I think the following guideline/rule is important: when using "match" or "merge" always use the unique key constraints and prepare the value with a "trim()" statement and (if possible from semantical perspective) enforce lower case with "tolower()".
Example: "merge(p:person {name:tolower(trim('a string'))}" where "name" is the property that holds the unique key constraints