Showing results for 
Search instead for 
Did you mean: 

Aggregating csv columns before data load into Neo4j database

Hello, I am using Neo4j version 4.2.1. I am trying to load some data from a csv file into the target Neo4j database and I am wondering how to aggregate / massage the below data before I load it. Below is a sample csv file format and the requirements on how it needs to be done:

'M10344' 24 14 2 'Date' 12

  1. Truncate first letter 'M' from the A -column and load the rest for the target attribute- say 'X' for the Node 1
  2. (B+C)-F and load the result for the target attribute - say 'Y' for the Node 1
  3. (B+C)*D and load the result for the target attribute - say 'Z' for the Node 1
  4. Create a brand new attribute called - 'DateTimestamp' to report the system generated time into some target attribute - say 'ImportTimestamp' for the Node 1

Below is the cypher i was using for non-aggregation columns. Can you customize it for the above requirements ?
'CALL apoc.periodic.iterate( "CALL apoc.load.csv('csv.file.csv') YIELD map WITH map where map.A<>'' return map ", "MERGE (Node1:Node1{A:map.A}) ON CREATE SET Node1.B=map.B,....... ", {batchSize:100, parallel:false})

Thank you



LOAD CSV WITH HEADERS FROM 'file:///file.csv' AS row
CREATE (n:Node)
SET n.x = substring(row.A,1}
SET n.y = toInteger(row.B) + toInteger(row.C) - toInteger(row.F)
SET n.z = toInteger(row.B) + toInteger(row.C) * toInteger(row.D)
SET n.timeStamp = timeStamp()

Here Is an simplify option, APOC is indeed the way to go only if you have a crazy amount of data or more complex operations to do. A lot more options exists to import data but it's a good start.

NOTE: The file.csv must be in the import folder of your database or accessible from an http link if you did not set otherwise in your configuration.

Limit the commit to 100 might make the creation process slower for no benefit.
1000 is the default recommended value.

Thank you Gabriel. Below is what I am doing but it does not load for some reason. Can you please tell me what I may be doing not right ?

CALL apoc.periodic.iterate( "CALL apoc.load.csv('WORK_ORDER_MODATA.csv') YIELD map WITH map where map.ORDER_NUMBER<>'' return map ", "MERGE (WOLine:WOLine{WOLineOriginationID:map.ORDER_NO}) ON CREATE SET WOLine.DateTimeImport=datetime(),WOLine.WOLineCreatedate=map.RECORD_CREATED_DT,WOLine.WOLineUpdatedate=map.RECORD_CHANGED_DT,WOLine.WOLineID=map.SCHEDULED_ITEM_NO,WOLine.WOLineTypeID=NULL,WOLine.WOLineOriginationTyID=substring(map.ORDER_NO,1),WOLine.WOLineDemandTypeID=NULL,WOLine.WOLineDemandID=map.JOB_NO,WOLine.WOLineOriginationConfigID=map.APC_CONFIG,WOLine.WOLineStatusCdID=NULL,WOLine.WOLineStatusCode=NULL,WOLine.WOLineRevReasonCd=NULL,WOLine.WOLineRevReasonCdID=NULL,WOLine.WOLineRevReasonDate=NULL,WOLine.WOLineOperationNumber=map.OPERATION_WHERE_USED,WOLine.WOLineCreatedate=map.SRC_UPDATED_DT,WOLine.WOLineDueDate=map.REQUIRED_DT,WOLine.WOLineIssuedate=map.LAST_ISSUE_DT,WOLine.WOLineClosedate=map.CLOSEOUT_DT,WOLine.WOLineRevDate=map.SRC_UPDATED_DT,WOLine.WOLineStatusUptdate=NULL,WOLine.WOLineDesc=NULL,WOLine.WOLineCurrencyID=NULL,WOLine.WOLineCostMaterialIssued=toInteger(map.UNIT_COST) * (toInteger(map.ISSUE_QTY_TOTAL_TO_DATE) + toInteger(map.COMPONENT_SCRAP)),WOLine.WOLineReqQty=map.TOTAL_QTY_REQUIRED,WOLine.WOLineIssuedQty=map.ISSUE_QTY_TOTAL_TO_DATE,WOLine.WOLineOutstdngQty=toInteger(map.TOTAL_QTY_REQUIRED)-toInteger(map.ISSUE_QTY_TOTAL_TO_DATE)+toInteger(map.COMPONENT_SCRAP),WOLine.WOLineScrpQty=map.COMPONENT_SCRAP ", {batchSize:100, parallel:false})

No worries. I figured it. It was some corruption going on with the csv and I started reading it using list map [0] based on the index position of the field. Thank you

Graph Steward

Here's a CSV lint program that can check the validity of your data.

Thank you guys. Another quick question? If I were to map a non-existing column in my csv file to NULL , I do it something like Node.TargetAttribute=NULL. In the same way if I have to default all the values to '0' , how do you recommend doing it ? Thanks much in advance