How to manage common properties in different rows while importing from CSV

paolodipietro58 · October 22, 2021, 9:46am

Neo4j Server Community edition 4.3.3 Ubuntu 20.04

Hi all,

For a long time I used a query to import a complex CSV file, and it run nicely.
Yesterday night I 'dream' an horrible bug in the query, but I have no idea how to solve it.

The problem resides in the structure of the data, and in the fact that I must maintain the existing DB content:
Each line describe the following info:

- a primary node, each one with:
   - three different labels for the node;
   - one of more relationships with another node (regions);
- one relationship with a complex object

So, in my import query, I do the following:

LOAD CSV WITH HEADERS FROM 'xxx' as row FIELDTERMINATOR ';'

// Select a primary node and change its name
WITH row, split(row.Appellation,",") as appellationNames
    MERGE  (appellation:Appellation { uuid: row.`Appellation UUID` })
	    SET appellation.name = appellationNames

// Replace the primary node labels
WITH row, appellation
   CALL apoc.create.setLabels(appellation,[row.`EU Classification`,row.`Classification`, 'Appellation']) YIELD node

// delete and replace the relationships with the regions
WITH row, appellation, split(row.Region,",") as regions

UNWIND regions as aRegion

	MATCH (region:Region)
		WHERE ToLower(region.name) = trim(ToLower(aRegion))

	WITH row, appellation, region
		MATCH (appellation)-[oldRelationship:IS_PRODUCED_IN]->(region) 
		DETACH DELETE oldRelationship

	WITH row, appellation, region
		MERGE (appellation)-[r:IS_PRODUCED_IN]->(region) 
	
// then finally start working on the core content which differ for each line
	with row, appellation
        <DO A LOT OF OTHER THINGS WITH REMAINING INFO>

Where is the problem?
The input file is ordered, and I have a bunch of sequential rows having the same appellation, labels and regions (because these are common data), but with all the remaining information that differs.

So, if I have a sequence of, just to say, 100 rows with the same initial info, I repeat 100 time the query described above, deleting and recreating the same info for each row, when I would execute just the latest query content <DO A LOT OF THINGS WITH REMAINING INFO>for all the lines except the first.

If I have been able to describe the problem, there someone who see an approach to solve it?

A nice workaround, for me, would be to have the ability to access the previous row ....

Thanks

giuseppe_villan · October 22, 2021, 1:33pm

@paolodipietro58
If I understood well, your problem is that the UNWIND cycle every row, even if appellation, labels and regions are equal.
Therefore, you can aggregate rows based on the same above properties, for example as follows (up to the line 6 is mock data, just to show you how I thought about this one, that is, first 3 row with equal props, and other differents. You have to change it with LOAD CSV WITH HEADERS FROM 'xxx' as row FIELDTERMINATOR ';'):

unwind [{Appellation: "q,w,e", `EU Classification`: 'AAA', `Classification`: "1", Region: "1,2,3", `Appellation UUID`: '123-31312', Other: 123123},
{Appellation: "q,w,e", `EU Classification`: 'AAA', `Classification`: "1", Region: "1,2,3", `Appellation UUID`: '4324-555', Other: 6456565656},
{Appellation: "q,w,e", `EU Classification`: 'AAA', `Classification`: "1", Region: "1,2,3", `Appellation UUID`: '322-554',Other: 'wfhaffsdak'},
{Appellation: "a,s,d", `EU Classification`: 'AAA', `Classification`: "1", Region: "1,2,3", `Appellation UUID`: '342-6545',Other: '6456565656'},
{Appellation: "q,w,e", `EU Classification`: 'ZZZ', `Classification`: "1", Region: "1,2,3", `Appellation UUID`: '23432-3534',Other: 'dsbjkhffdbkjh'}] as row
// --end mock data

with split(row.Appellation,",") as appellationNames, row.`EU Classification` as label, row.`Classification` as classification, split(row.Region,",") as regions, collect(row) as commonRows
MERGE (appellation:Appellation { uuid: commonRows[0].`Appellation UUID` })
SET appellation.name = appellationNames

// Replace the primary node labels
with appellation, label, classification, regions, commonRows
CALL apoc.create.setLabels(appellation,[label,classification, 'Appellation']) YIELD node
WITH node, regions, appellation, commonRows

UNWIND regions as aRegion
 MATCH (region:Region)
 WHERE ToLower(region.name) = trim(ToLower(aRegion))

 WITH commonRows, appellation, region
 MATCH (appellation)-[oldRelationship:IS_PRODUCED_IN]->(region) 
 DETACH DELETE oldRelationship

 WITH commonRows, appellation, region
 MERGE (appellation)-[r:IS_PRODUCED_IN]->(region) 

 WITH commonRows, appellation , region
 // <DO A LOT OF OTHER THINGS WITH REMAINING INFO>

The "core" thing is the line with split(row.Appellation,",") as appellationNames, row.EU Classification as label, row.Classification as classification, split(row.Region,",") as regions, collect(row) as commonRows, where I collect all rows based on appellationNames, label, classification, regions.

Then, you can possibly cycle commonRows in <DO A LOT OF OTHER THINGS WITH REMAINING INFO> part.

In this example I put first Appellation UUID found (based on common properties), otherwise you can change commonRows[0] to commonRows[size(commonRows) - 1]

paolodipietro58 · October 22, 2021, 1:42pm

Hi Giuseppe,

Yes, I understand what you are trying to tell me, but, IMHO, this is a solution that can be used with a standard query.

I don't understand, and you would try to describe, how can I obtain such result starting from LOAD CSV.

giuseppe_villan · October 25, 2021, 12:57pm

Of course. Let's suppose I have the following csv test.csv:

Appellation;EU Classification;Classification;Region;Appellation UUID;Other
q,w,e;AAA;1;1,2,3;123-31312;123123
q,w,e;AAA;1;1,2,3;4324-555;6456565656
q,w,e;AAA;1;1,2,3;322-554;wfhaffsdak
a,s,d;AAA;1;1,2,3;342-6545;6456565656
q,w,e;ZZZ;1;1,2,3;23432-3534;dsbjkhffdbkjh

I could do:

LOAD CSV WITH HEADERS FROM 'file:///test.csv' as row FIELDTERMINATOR ';' 

with split(row.Appellation,",") as appellationNames, row.`EU Classification` as label, row.`Classification` as classification, split(row.Region,",") as regions, collect(row) as commonRows
MERGE (appellation:Appellation { uuid: commonRows[0].`Appellation UUID` })
SET appellation.name = appellationNames

// Replace the primary node labels
with appellation, label, classification, regions, commonRows
CALL apoc.create.setLabels(appellation,[label,classification, 'Appellation']) YIELD node
WITH node, regions, appellation, commonRows

UNWIND regions as aRegion
 MATCH (region:Region)
 WHERE ToLower(region.name) = trim(ToLower(aRegion))

 WITH commonRows, appellation, region
 MATCH (appellation)-[oldRelationship:IS_PRODUCED_IN]->(region) 
 DETACH DELETE oldRelationship

 WITH commonRows, appellation, region
 MERGE (appellation)-[r:IS_PRODUCED_IN]->(region) 

 WITH commonRows, appellation , region
 // <DO A LOT OF OTHER THINGS WITH REMAINING INFO>

Topic		Replies	Views
Merging nodes with common identities from multiple CSV files Neo4j Graph Platform migrated	4	216	November 23, 2022
CSV import issue Import / Export	26	711	June 21, 2023
Importing CSV with multiple columns containing arrays that are related Import / Export apoc , cypher , unwind , import , csv	2	587	January 21, 2021
Data modelling in Neo4j Modeling cypher , import	6	500	December 31, 2020
Importing a csv file with a column that has a string with csv values Import / Export cypher , import , neo4j-desktop	5	683	May 15, 2021

July Summer Fun!

How to manage common properties in different rows while importing from CSV

Related topics