Using Nested `LOAD CSV` with `apoc.periodic.iterate`


(Michael McKenzie) #1

I am suing LOAD CSV and apoc.periodic.iterate to import data. I have one .csv that contains the folder and file names that contain the data to be imported. Below is the code I currently have:

CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS FROM 'file:///NBIFilesListFiles.csv' AS line1
WITH collect('file:///'+line1.Folder+'/'+line1.File) as Files
UNWIND Files AS file
LOAD CSV WITH HEADERS FROM file AS row RETURN row",
"
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,

			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
",
{batchSize:1000,iterateList:true})

My goal is to use LOAD CSV to load the .csv that contains all the other file names to import and pass that into apoc.periodic.iterate. Is that possible?

Right now what I have in the code above tends to bog down my machine and cause Neo4j Desktop to crash. I have adjusted these settings to try and help

dbms.memory.heap.initial_size=1G
dbms.memory.heap.max_size=2G

Help with `apoc.periodic.iterate`
(Andrew Bowman) #2

If I recall we don't recommend using LOAD CSV within apoc.periodic.iterate(), but you can use apoc.load.csv()


(Michael McKenzie) #3

@andrew.bowman so the recommendation is to use apoc.load.csv() inside of apoc.periodic.iterate()?


(Michael McKenzie) #4

@andrew.bowman I somehow skipped over that in the APOC library. I have taken that and applied it as follows (I simplified the MERGE statements in my original post):

CALL apoc.periodic.iterate(
'
CALL apoc.load.csv({"https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318"}) yield map as row return row
','
WITH row.URL as url
CALL apoc.periodic.iterate(
'
CALL apoc.load.csv({url}) yield map as row1 RETURN row1
','
MERGE (state:State {id: row1.STATE_CODE_001})
ON CREATE SET state.name = row1.STATE_CODE_001
',
{batchSize:1000, iterateList:true})
',
{batchSize:10, iterateList:true})

I am getting an error on the line CALL apoc.load.csv({url}) yield map as row1 RETURN row1 that reads
"Neo.ClientError.Statement.SyntaxError: Invalid input 'A': expected 'o/O' (line 8, column 2 (offset: 302))
"CALL apoc.load.csv({url}) yield map as row1 RETURN row1"


Help with `apoc.periodic.iterate`
(Michael McKenzie) #5

SOLUTION: (explanation below)

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL
CALL apoc.periodic.iterate(
'
LOAD CSV WITH HEADERS FROM $url AS row RETURN row
//CALL apoc.load.csv($url) YIELD map AS row RETURN row
','
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,
			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
',
{batchSize:1000, parallel:false, params:{url:fileURL}}) YIELD batches, total
RETURN batches, total

You will notice that //CALL apoc.load.csv($url) YIELD map AS row RETURN row is commented out. While that is supposed to work. One of the files at the URL I was calling was throwing an error, so at this point I had to use LOAD CSV instead.

EXPLANATION

The first part of the code

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL

loads the file URLs from the shared Google Sheet. The WITH clause is included to "ignore" rows where the file URL has not been verified. All that satisfy are put into a collection and then using the UNWIND passed in to the next batch of code contained within the apoc.periodic.iterate function. The code within apoc.periodic.iterate load the data for each url and add it to the graph using MERGE.