Using Nested `LOAD CSV` with `apoc.periodic.iterate`

mckenzma · February 21, 2019, 1:19am

I am suing LOAD CSV and apoc.periodic.iterate to import data. I have one .csv that contains the folder and file names that contain the data to be imported. Below is the code I currently have:

CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS FROM 'file:///NBIFilesListFiles.csv' AS line1
WITH collect('file:///'+line1.Folder+'/'+line1.File) as Files
UNWIND Files AS file
LOAD CSV WITH HEADERS FROM file AS row RETURN row",
"
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,

			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
",
{batchSize:1000,iterateList:true})

My goal is to use LOAD CSV to load the .csv that contains all the other file names to import and pass that into apoc.periodic.iterate. Is that possible?

Right now what I have in the code above tends to bog down my machine and cause Neo4j Desktop to crash. I have adjusted these settings to try and help

dbms.memory.heap.initial_size=1G
dbms.memory.heap.max_size=2G

andrew_bowman · February 21, 2019, 1:48am

If I recall we don't recommend using LOAD CSV within apoc.periodic.iterate(), but you can use apoc.load.csv()

mckenzma · February 21, 2019, 3:22pm

@andrew_bowman so the recommendation is to use apoc.load.csv() inside of apoc.periodic.iterate()?

mckenzma · February 22, 2019, 2:56am

@andrew_bowman I somehow skipped over that in the APOC library. I have taken that and applied it as follows (I simplified the MERGE statements in my original post):

CALL apoc.periodic.iterate(
'
CALL apoc.load.csv({"https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318"}) yield map as row return row
','
WITH row.URL as url
CALL apoc.periodic.iterate(
'
CALL apoc.load.csv({url}) yield map as row1 RETURN row1
','
MERGE (state:State {id: row1.STATE_CODE_001})
ON CREATE SET state.name = row1.STATE_CODE_001
',
{batchSize:1000, iterateList:true})
',
{batchSize:10, iterateList:true})

I am getting an error on the line CALL apoc.load.csv({url}) yield map as row1 RETURN row1 that reads
"Neo.ClientError.Statement.SyntaxError: Invalid input 'A': expected 'o/O' (line 8, column 2 (offset: 302))
"CALL apoc.load.csv({url}) yield map as row1 RETURN row1"

mckenzma · March 14, 2019, 3:13pm

SOLUTION: (explanation below)

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL
CALL apoc.periodic.iterate(
'
LOAD CSV WITH HEADERS FROM $url AS row RETURN row
//CALL apoc.load.csv($url) YIELD map AS row RETURN row
','
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,
			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
',
{batchSize:1000, parallel:false, params:{url:fileURL}}) YIELD batches, total
RETURN batches, total

You will notice that //CALL apoc.load.csv($url) YIELD map AS row RETURN row is commented out. While that is supposed to work. One of the files at the URL I was calling was throwing an error, so at this point I had to use LOAD CSV instead.

EXPLANATION

The first part of the code

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL

loads the file URLs from the shared Google Sheet. The WITH clause is included to "ignore" rows where the file URL has not been verified. All that satisfy are put into a collection and then using the UNWIND passed in to the next batch of code contained within the apoc.periodic.iterate function. The code within apoc.periodic.iterate load the data for each url and add it to the graph using MERGE.

mckenzma · August 12, 2019, 2:15pm

Just following up with some additional information about the websocket error I was encountering.

In several of the files there are internal ' and " within column entries. Using CALL apoc.load.csv(url,{header:true, quoteChar: "\u0000"}), in lieu of LOAD CSV WITH HEADERS, where \u0000 is a null character I was able to go through all the file rows and create nodes.

@jsmccrumb and @mdfrenchman helped me with figuring out how to use \u0000 to get around this issue.

Topic		Replies	Views
Apoc.load.csv fail (there's no apoc called apoc.load.csv) Neo4j Graph Platform migrated	3	507	January 11, 2023
Using neo4j-import with existing database in Neo4j Desktop Import / Export	7	2670	September 24, 2018
APOC load csv file with call apoc.merge.node does not execute Procedures & APOC load-csv , apocperiodiciterate	4	1850	August 31, 2019
Import csv or api data in python using apoc.periodic.iterate Neo4j Graph Platform migrated	1	259	January 31, 2023
Help with `apoc.periodic.iterate` Procedures & APOC	7	5089	March 14, 2019

Using Nested `LOAD CSV` with `apoc.periodic.iterate`

Related topics