I'm importing a CSV which includes vehicle details. Here in Australia some classes of vehicle do not need to have registration, and the CSV has a blank field for these. I'd like to generate an "artificial" rego for these rows with a pattern like "UNREG1", "UNREG2", and so on.
Whilst loading the CSV can I maintain a variable which I increment as necessary to produce these values? I've started with:
WITH 1 AS unreg
LOAD CSV WITH HEADERS FROM 'file:///file.csv' AS row
WITH unreg, row CASE trim(row.REGO) WHEN "" THEN "UNREG" + toString(unreg) ELSE row.REGO END AS NewRego
MERGE (v:Vehicle {Rego: NewRego})
WITH row CASE trim(row.REGO) WHEN "" THEN unreg+1 ELSE unreg END AS unreg
RETURN unreg;
It's not working unfortunately, but I'm not particularly experienced with Cypher (unreg is always 1 and there are a dozen rows with no supplied REGO value). I'm trying to avoid pre-processing the CSV before loading into neo4j, but if I have to then I will.
With LOAD CSV, you can use linenumber() as shown in the doc:
LOAD CSV WITH HEADERS FROM 'file:///file.csv' AS row
WITH row, CASE trim(row.REGO) WHEN "" THEN "UNREG" + toString(linenumber()) ELSE row.REGO END AS NewRego
MERGE (v:Vehicle {Rego: NewRego})
RETURN NewRego;
@cobra: Unfortunately not, but it's likely to be the way I'm loading the data. When a CSV file has an empty cell, will it be presented in Cypher as NULL or an empty string? At the moment the CASE statement looks for both but it feels clumsy. The script is:
LOAD CSV WITH HEADERS FROM 'file:///file.csv' AS row
WITH row
(CASE trim(row.REGO)
WHEN "" THEN "UNREG" + toString(linenumber())
WHEN NULL THEN "UNREG" + toString(linenumber())
ELSE row.REGO END) AS NewRego
MERGE (v:Vehicle {Rego: toUpper(replace(NewRego, " ", ""))})
RETURN count(v);
Searching for the data returns nothing, so I'm obviously doing something wrong.
and linenumber() may work but if for example you run multiple LOAD CSVs then linenumber() will repeat, for example if the 1st LOAD CSV reads a file with 100 rows of data this is thus linenumber 1 though 100. If you run another LOAD CSV and this reads a file with 250 rows of data this is thus linenumbers 1 through 250. So you could have multiple nodes with the same UNREG##
If the cell is empty, the value will be considered null, so this query should work by using coalesce() function:
LOAD CSV WITH HEADERS FROM 'file:///file.csv' AS row
WITH coalesce(row.REGO, 'UNREG_'+apoc.create.uuid()) AS Rego
MERGE (v:Vehicle {Rego: Rego})
RETURN Rego;
Thanks. I've copied the suggestion from the Cypher guide and re-read the import file a number of times to get different clusters of nodes (i.e. once per label). The vehicle details are the only one where I need to artificially create missing data. In an RDBMS the vehicle rego would be the primary key, with neo4j it's the unique name used to create the relationships between a vehicle node and an incident, or policy.
@cobra: I thought about using something like a UUID although I hadn't looked for that functionality in detail.
Unfortunately that's going to create a rego that's too long to be believable - that sounds like a small thing, and if neo4j was the only data source then it wouldn't apply. But the data is being taken from a SaaS system and I'm creating a graph to illustrate the value of a different view of the data we have. Here a rego has to be 9 characters or less and I pretty much have to stick to that limit.
Using a pattern like "UNREGnnnn" I can handle a 10,000 line import file if the digits are taken from the line number, and that's easily long enough for a demo.
I'll make a note of the suggestion though, it's sure to come up in future.
I've only shown a few rows and a few fields as an illustration. I haven't included the VIN (Vehicle Identification) for example because it's rarely provided and considerably less important.
Thanks, that's the approach I'll go with for the moment.
I'm trying to avoid pre-processing the CSV for a couple of reasons - the more capability I can show from Cypher the more credibility neo4j has as a useful tool, but also we have a strong preference for reducing the amount of code that will have to be supported. Mind you, if this is approved there will be automated regular data extraction from the SaaS product as well as the Cypher code, so a bit of Python is unlikely to make much difference.
Here is my solution:
I generated unique number from current date, time and linenumber. I used the data you presented.
Here is the Cypher:
LOAD CSV WITH HEADERS FROM "file:/rego.csv" AS row
with row, linenumber() as lnbr
with row, lnbr, apoc.date.format(apoc.date.parse(toSTring(date()), 'ms', 'yyyy-MM-dd'), 'ms', 'yy/MM/dd') as dt
with row, lnbr, split(dt, '/') as dt2
unwind dt2 as dt3
with row, lnbr, sum(toInteger(dt3)) as dt31
with row, lnbr, localtime({timezone: 'America/Los Angeles'}) AS sfo, dt31
with row, lnbr, split(toString(sfo), ':') as sfo1, dt31
unwind sfo1 as sfo2
with row, lnbr, dt31, sum(toInteger(sfo2)) as sfo3
with row, toString(dt31) + toString(sfo3) + toString(lnbr) as final
with coalesce(row.REGO, 'UNREG'+final) AS Rego
MERGE (v:Vehicle {Rego: Rego})
Unreg nodes: Rego:UNREG41363, Rego:UNREG41365
Wow! Thank you for that - there's a lot I haven't come across before in that answer. It's helpful to see what looks to me to be a sophisticated solution. Cypher is clearly more able that I'd realised !