cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Pause, resolving how to handle anonymous content

Migrating Database to Neo4J as a newbie

kathleen
Node Link

Hello!

I am new to Neo4j and attempting to bring over another graph database that I have structured in a database as, Name1, Entity1, Relationship1, Relationship2, Name2, Entity2, Source. As an example, it might say: Julie, Person, has employer, has employee, Company, Costco, www.linkedin.com.

I have been toying with different ways of importing the data and the best I have found is to load a CSV file with each entity type/name. For example, one file for all the Person nodes with their labels and then one with Company nodes.

Then loaded a CSV with the relationships using the following CSV Load in attempts to get the following to get a basic structure of:

Name1, Relationship 1, Name2

Or

Julie has employer Costco

LOAD CSV FROM "http://localhost:11001/project-ba2a314e-1c09-4a9b-b568-3bc04a75ad42/COMPLETE.csv" AS line
MERGE (n1:Name1 {name:line[1]})
MERGE (n2:Name2 {name:line[5]})
WITH n1, n2, line
CALL apoc.merge.relationship(n1,line[1],{},{},n2) YIELD rel
RETURN n1, n2, line

Problem is, even with this basic load I am getting a duplicate node for both Julie and Costco. I am not sure how to correct this issue or if there is an entirely better way to load the data. I have nearly 30,000 nodes structured this way so I would prefer to not have to do any massive data massaging in possible.

1 ACCEPTED SOLUTION

johnmattgrogan
Node Clone

Hi Kathleen.

So a few questions.

  1. in your LOAD CSV statement you aren't including headers. Are there headers in your file? Accessing your rows / lines using header names can help sometimes with debugging.
  2. In your original query you using MERGE to add Entity1 and Entity2, however you are giving them labels :Name1 and :Name2. In your data Costco can show up in either Name1 or Name2 therefore a MERGE will create a node for (n:Name1 {name:'Costco'}) and for (n:Name2 {name:'Costco'}). This will create the duplication for all the values that show up in both Name1 and Name2 columns. Cypher will not let you dynamically set Labels out of the box, but you can use APOC.
LOAD CSV WITH HEADERS FROM 'https:path.to.file' as line
CALL apoc.merge.node([line['Entity1']], {name:line['Name1']}) yield node as n1
CALL apoc.merge.node([line['Entity2']], {name:line['Name2']}) yield node as n2
CALL apoc.merge.relationship(n1,line['Relationship1'],{},{},n2) YIELD rel as r1
CALL apoc.merge.relationship(n2,line['Relationship2'],{},{},n1) YIELD rel as r2
return n1, n2, r1, r2

This should solve the duplication problem.

Lastly, if you don't mind me asking, what's the use case for having each the employee and the employer have a different relationship? Once the nodes are connected the single relationship should suffice, unless having the direction of the relationship is important to your use case.

View solution in original post

12 REPLIES 12

johnmattgrogan
Node Clone

Hi Kathleen,

I it sounds like you are on the right path when trying to split your entities and relationships into seperate .csv files. Speaking from experience, is it possible there are some trailing white spaces or some some other data issues in the .csv files? Do you mind posting what the first few lines of your .csv looks like if it fits here?

Also, you have line[1] as both a name property and being set as the relationshipType in the apoc.merge.relationship function.

Hi John,

Thank you for the reply and you are definitely right, line[1] should be line[3]. I made the adjustment. I also trimmed all the csv docs again and reloaded them. I am having the same issue. I believe the issue has to do with the relationship data csv not recognizing if the entity is a person or company and simply uploading the node type as Name1 and Name2, if that makes sense?

I attached here a sample of the data:

Hi John
You can try algorithm like louvain similarity and decide it by yourself.
Thanking you
Yours faithfully
Sameer SG

Hi Sameer,

I'm not sure I understand your question. Is this related to the possible issues in the data?

Thanks!

-Matt

kathleen
Node Link

[{'\ufeffID': '190004',
'Name1': 'Joe Smith',
'Entity1': 'Person',
'Relationship1': 'has employer',
'Relationship2': 'has employee',
'Name2': 'Costco',
'Entity2': 'Company',
'Source': 'costco website'},
{'\ufeffID': '190005',
'Name1': 'Daisy Duke',
'Entity1': 'Person',
'Relationship1': 'has employer',
'Relationship2': 'has employee',
'Name2': 'Trader Joes',
'Entity2': 'Company',
'Source': 'traderjoes website'},
{'\ufeffID': '190006',
'Name1': 'Costco',
'Entity1': 'Company',
'Relationship1': 'has competitor',
'Relationship2': 'has competitor',
'Name2': 'Walmart',
'Entity2': 'Company',
'Source': 'costco website'},
{'\ufeffID': '190007',
'Name1': 'Timothy Neal',
'Entity1': 'Person',
'Relationship1': 'shops at',
'Relationship2': 'has shopper',
'Name2': 'Costco',
'Entity2': 'Company',
'Source': 'costco website'},
{'\ufeffID': '190008',
'Name1': 'Trader Joes',
'Entity1': 'Company',
'Relationship1': 'has shopper',
'Relationship2': 'shops at',
'Name2': 'Daisy Duke',
'Entity2': 'Person',
'Source': 'traderjoes website'},
{'\ufeffID': '190009',
'Name1': 'Safeway',
'Entity1': 'Company',
'Relationship1': 'has employee',
'Relationship2': 'has employer',
'Name2': 'Timothy Neal',
'Entity2': 'Person',
'Source': 'safeway website'},
{'\ufeffID': '190010',
'Name1': 'Aldi',
'Entity1': 'Company',
'Relationship1': 'has employee',
'Relationship2': 'has employer',
'Name2': 'Bob Smith',
'Entity2': 'Person',
'Source': 'aldi website'},
{'\ufeffID': '190011',
'Name1': 'Lidl',
'Entity1': 'Company',
'Relationship1': 'has employee',
'Relationship2': 'has employer',
'Name2': 'Jordan Julie',
'Entity2': 'Person',
'Source': 'lidl website'},
{'\ufeffID': '190012',
'Name1': 'Grocery Store Association',
'Entity1': 'Company',
'Relationship1': 'has Members',
'Relationship2': 'Member of',
'Name2': 'Aldi',
'Entity2': 'Company',
'Source': 'groceries website'},
{'\ufeffID': '190013',
'Name1': 'Safeway',
'Entity1': 'Company',
'Relationship1': 'Member of',
'Relationship2': 'has Member',
'Name2': 'Grocery Store Association',
'Entity2': 'Company',
'Source': 'safeway website'}]

Great thank you! Let me check a few things and I'll check back with you.

johnmattgrogan
Node Clone

Hi Kathleen.

So a few questions.

  1. in your LOAD CSV statement you aren't including headers. Are there headers in your file? Accessing your rows / lines using header names can help sometimes with debugging.
  2. In your original query you using MERGE to add Entity1 and Entity2, however you are giving them labels :Name1 and :Name2. In your data Costco can show up in either Name1 or Name2 therefore a MERGE will create a node for (n:Name1 {name:'Costco'}) and for (n:Name2 {name:'Costco'}). This will create the duplication for all the values that show up in both Name1 and Name2 columns. Cypher will not let you dynamically set Labels out of the box, but you can use APOC.
LOAD CSV WITH HEADERS FROM 'https:path.to.file' as line
CALL apoc.merge.node([line['Entity1']], {name:line['Name1']}) yield node as n1
CALL apoc.merge.node([line['Entity2']], {name:line['Name2']}) yield node as n2
CALL apoc.merge.relationship(n1,line['Relationship1'],{},{},n2) YIELD rel as r1
CALL apoc.merge.relationship(n2,line['Relationship2'],{},{},n1) YIELD rel as r2
return n1, n2, r1, r2

This should solve the duplication problem.

Lastly, if you don't mind me asking, what's the use case for having each the employee and the employer have a different relationship? Once the nodes are connected the single relationship should suffice, unless having the direction of the relationship is important to your use case.

Hi! Thank you! I am going to take a crack out of this later today and see if I can get it to work. I don't need to have both sides of the relationship in the data. I just have that set up in the database from old system I used to model data.

This worked perfectly! Thank you!

ameyasoft
Graph Maven
If you have already created Name1 and Name2 nodes then you should use 'MATCH' in place of 'MERGE'.  Expecting that Name1 and Name2 have a property, 'entity' that stores 'person' /'company' values. 

Using MERGE statement with only name , the system cannot find Name1 or Name2 node just with name property and hence creates a new nodes for Person and Company. Better approach would have been having nodes with labels 'Person' and 'Company'. Also I prefer to have one relationship between Person and Company as (joe Smith)-[:HAS_EMPLOYER]->(Costco) or (joe Smith)-[:EMPLOYEE_OF]->(Costco) 

LOAD CSV FROM "http://localhost:11001/project-ba2a314e-1c09-4a9b-b568-3bc04a75ad42/COMPLETE.csv 1" AS line
MATCH (n1:Name1 {name:line[1]})
MATCH (n2:Name2 {name:line[5]})
WITH n1, n2, line
CALL apoc.merge.relationship(n1,line[1],{},{},n2) YIELD rel
RETURN n1, n2, line

Thank you for your input! It looks like I was able to fix it using johnmattgrogan's input.

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online