cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Expect disruption to service on Thursday, Feb. 9!

How to provide boolean (true/false) values to LOAD CSV?

tms
Graph Buddy

Is there a way -- short of explicit conversions in the load -- to persuade LOAD CSV to recognize a boolean value in its input?

LOAD CSV does the expected thing when unquoted numbers (integers and floats) are provided as the value for a field.

When the .csv file contains a string ("true"), then the value is loaded as a string. When the same value contains an unquoted string (true), then the result value is null.

I know that I can manually patch the resulting nodes after the load, but it would be cleaner if there is a way to have LOAD CSV recognize a boolean value in a given field. Some examples might be:

 

"booleanField", "integerField", "floatField", "stringField"
True, 1234, 4.321, "Hello sailor"
TRUE, 1234, 4.321, "Hello sailor"
true, 1234, 4.321, "Hello sailor"

In the trivial example above, the values of integerField, floatField, and stringField are already coerced as expected.

How can I do the same with the distinguished (to Neo4J) true and false values?

1 ACCEPTED SOLUTION

tms
Graph Buddy

Heh. I have written code that says that: the value: true (without quotes) is arbitrarily coerced to the boolean value true. When quotes are present ("true"), then that value is coerced to a string.

I suggest that it isn't "incomprehensible", it is instead arbitrary. The resulting heuristic is "If you want to get a string, then delimit the value with double-quote."

It sounds to me as though this is either an oversight or a bug in the Cypher parser. There is no technical issue that blocks coercing boolean values -- it is no different from recognizing and coercing integer and float values.

FWIW, there is at least one attempt at an RFC:  (https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml)

In the "General" section of that, the last bullet reads:

"In some uses, there is an assumption of strong data typing, with unquoted fields considered to be numeric, and quoted fields considered to be text data."

I suggest merely that there is no technical barrier to treating an unquote field containing true or false as a Boolean value. For extra credit -- the "apoc.export.csv.query" could use the strong typing of the underlying Java representation to not quote numeric and boolean property values.

I suppose at the end of the day it would have been easier for me to export and import json -- that probably moots the question. 🙂

View solution in original post

6 REPLIES 6

koji
Ninja
Ninja

Hi @tms 

How about this one?
toBoolean() works.

LOAD CSV WITH HEADERS FROM 'file:///true.csv' AS line
CREATE (:SomeNode {
    booleanField: toBoolean(line.booleanField),
    integerField: toInteger(line.integerField),
    floatField: toFloat(line.floatField),
    stringField: line.stringField
})

 

tms
Graph Buddy

I wish it were that easy. 🙂

Each row that I'm loading is a descriptor used by the rest of the database -- its metadata about a given property elsewhere in the database. Each row has a  "minimumValue" and "maximumValue" property. Each of those can be either boolean, integer, or float.

In order to do explicit coercion like this, I have to create a map that associates a type with each property being described.

At the moment, there is just one property with boolean minimumValue (false) and maximumValue (true). I've therefore solved it by simply patching the DB after the load with the following:

 

 

 

MATCH(mdp:MetaDatapoint {
    propertyName: 'isHotSpot'
})
SET
    mdp.minimumValue = false,
    mdp.maximumValue = true

 

 

 

This works, but it's a hack. A much cleaner solution is to have LOAD CSV recognize boolean values (true/false) based on finding an unquoted true/false as the value of the field.

The current behavior is that it sets minimumValue and maximumValue to the distinguished null value.

For context, here is an example of the data I'm loading. All but the last row load as expected.

"minimumValue","propertyName","datasetID","centiles","maximumValue","metaDatapointID"
0,"dailyCaseCount","zeetix.20220817.nyTimesMetaDataset","[[0,0],[1,0.0],[2,0.0],[3,0.0],[4,0.0],[5,0.0],[6,0.0],[7,0.0],[8,0.0],[9,0.0],[10,0.0],[11,0.0],[12,0.0],[13,0.0],[14,0.0],[15,0.0],[16,0.0],[17,0.0],[18,0.0],[19,0.0],[20,0.0],[21,0.0],[22,0.0],[23,0.0],[24,0.0],[25,0.0],[26,0.0],[27,0.0],[28,0.0],[29,0.0],[30,0.0],[31,0.0],[32,0.0],[33,0.0],[34,0.0],[35,0.0],[36,0.0],[37,0.0],[38,0.0],[39,0.0],[40,0.0],[41,0.0],[42,1.0],[43,1.0],[44,1.0],[45,1.0],[46,1.0],[47,1.0],[48,1.0],[49,1.0],[50,2.0],[51,2.0],[52,2.0],[53,2.0],[54,2.0],[55,3.0],[56,3.0],[57,3.0],[58,3.0],[59,4.0],[60,4.0],[61,4.0],[62,5.0],[63,5.0],[64,5.0],[65,6.0],[66,6.0],[67,7.0],[68,7.0],[69,8.0],[70,9.0],[71,9.0],[72,10.0],[73,11.0],[74,12.0],[75,13.0],[76,14.0],[77,15.0],[78,16.0],[79,17.0],[80,19.0],[81,21.0],[82,22.0],[83,25.0],[84,27.0],[85,30.0],[86,33.0],[87,36.0],[88,41.0],[89,46.0],[90,52.0],[91,59.0],[92,68.0],[93,79.0],[94,95.0],[95,115.0],[96,145.0],[97,191.0],[98,278.0],[99,496.0],[100,193786]]",193786,"zeetix.20220817.nyTimesMetaDataset.dailyCaseCount"
0.0,"dailyCaseCountPerKSquareMile","zeetix.20220817.nyTimesMetaDataset","[[0,0.0],[1,0.0],[2,0.0],[3,0.0],[4,0.0],[5,0.0],[6,0.0],[7,0.0],[8,0.0],[9,0.0],[10,0.0],[11,0.0],[12,0.0],[13,0.0],[14,0.0],[15,0.0],[16,0.0],[17,0.0],[18,0.0],[19,0.0],[20,0.0],[21,0.0],[22,0.0],[23,0.0],[24,0.0],[25,0.0],[26,0.0],[27,0.0],[28,0.0],[29,0.0],[30,0.0],[31,0.0],[32,0.0],[33,0.0],[34,0.0],[35,0.0],[36,0.0],[37,0.0],[38,0.0],[39,0.0],[40,0.0],[41,0.0],[42,0.3864643937793918],[43,0.7593046136487283],[44,1.041008987551094],[45,1.2403516148757852],[46,1.4794190617228424],[47,1.7111429519713621],[48,1.9103127277450955],[49,2.1664599502797444],[50,2.403142830193327],[51,2.6970170990884084],[52,3.0456883712572296],[53,3.384277998131879],[54,3.743705894464931],[55,4.162429082614505],[56,4.592232697615254],[57,4.999745846252816],[58,5.451978068509557],[59,5.99369912379611],[60,6.596099396621808],[61,7.165655626779898],[62,7.830669600557544],[63,8.563721222328477],[64,9.293714846523592],[65,10.085409006558397],[66,11.0039852766677],[67,11.958136954150909],[68,12.937351798014673],[69,14.095376364168143],[70,15.295255794034086],[71,16.561062292779706],[72,17.992801314882627],[73,19.55769193158285],[74,21.286514354560953],[75,23.080044022147124],[76,25.122752560427198],[77,27.37722293272634],[78,29.840494419013712],[79,32.55601960802552],[80,35.654346392162154],[81,39.04399015381558],[82,43.0303799255035],[83,47.34017616590013],[84,52.36237620463216],[85,58.138689844624366],[86,64.85496958940826],[87,72.43117473529088],[88,81.66581938379056],[89,92.46201567937439],[90,105.93932002459859],[91,122.07087198299608],[92,142.63715024409288],[93,169.78978971073875],[94,205.67865598548184],[95,257.446173776362],[96,333.13900224868826],[97,458.54146436788665],[98,698.108974342593],[99,1301.7564445871326],[100,181159.75549314392]]",181159.75549314392,"zeetix.20220817.nyTimesMetaDataset.dailyCaseCountPerKSquareMile"
0.00687262602744986,"cumulativeCaseCountPerKSquareMile","zeetix.20220817.nyTimesMetaDataset","[[0,0.00687262602744986],[1,1.8056977530796015],[2,3.865892199596014],[3,7.053234559113159],[4,11.495335461844364],[5,17.401041206340253],[6,24.974625787710018],[7,34.083685673433976],[8,45.629822836477246],[9,58.26436106863103],[10,74.40679185196025],[11,92.45053672481164],[12,113.22377095749718],[13,135.77074140522717],[14,160.0226203379926],[15,186.64801717168882],[16,216.13443561895497],[17,248.83643338173374],[18,283.10769836945315],[19,322.1883246803568],[20,362.1691330064685],[21,404.7571902369573],[22,451.96765525762174],[23,504.82156832340013],[24,560.1003042012661],[25,621.0793016558364],[26,684.1395752079137],[27,749.747640808052],[28,820.8225630256252],[29,893.420457154832],[30,972.0420548387679],[31,1058.3594782352138],[32,1152.6106726458486],[33,1252.2906775241047],[34,1358.3475814528972],[35,1470.5809146360118],[36,1584.65027770509],[37,1702.7050521772323],[38,1830.6134537262894],[39,1958.8335714463703],[40,2085.3729395003065],[41,2219.2014915497416],[42,2354.921338245425],[43,2493.5806520388965],[44,2636.9527670247003],[45,2782.030349142753],[46,2943.3732770047855],[47,3111.1415007716805],[48,3279.495325162168],[49,3462.1757301481316],[50,3654.9316535748558],[51,3872.127483845892],[52,4099.158296218181],[53,4317.3052733408385],[54,4548.360570986163],[55,4782.267666630222],[56,5018.61395414905],[57,5272.326602410692],[58,5534.170745881019],[59,5818.620607387053],[60,6124.776068726533],[61,6428.007466927236],[62,6749.77607353188],[63,7085.319346550683],[64,7410.534425457674],[65,7778.873538004958],[66,8168.486038442066],[67,8598.466334388799],[68,9045.777059020646],[69,9555.504047243969],[70,10110.403435849526],[71,10679.610140717883],[72,11303.208081138906],[73,12010.98041453275],[74,12771.453832126872],[75,13562.114156628963],[76,14390.388631985088],[77,15299.337375637475],[78,16267.023169525326],[79,17345.8174950131],[80,18578.097609475626],[81,19890.564532344175],[82,21450.99058395474],[83,23333.89507525181],[84,25374.419072194127],[85,27587.150562802435],[86,30298.58986848416],[87,33540.896279713255],[88,37277.06123241652],[89,41804.20616688468],[90,47305.87660752076],[91,53750.86108549062],[92,62242.36186023764],[93,73440.02220643841],[94,88342.9899120614],[95,107069.74599852647],[96,133464.54545345053],[97,176183.20050675844],[98,251898.30245717388],[99,414403.44666902395],[100,9162682.967123741]]",9162682.967123741,"zeetix.20220817.nyTimesMetaDataset.cumulativeCaseCountPerKSquareMile"
-4610.113695084961,"cumulativeDeathCountPer100MPersonMileAnomalies","zeetix.20220817.nyTimesMetaDataset","[[0,-4610.113695084961],[1,-0.05976783129517571],[2,0.0],[3,0.0],[4,0.0],[5,0.0],[6,0.0],[7,0.0],[8,0.0],[9,0.0],[10,0.0],[11,0.0],[12,0.0],[13,0.0],[14,0.0],[15,0.0],[16,0.0],[17,0.0],[18,0.0],[19,0.0],[20,0.0],[21,0.0],[22,0.0],[23,0.0],[24,0.0],[25,0.0],[26,0.0],[27,0.0],[28,0.0],[29,0.0],[30,0.0],[31,0.0],[32,0.0],[33,0.0],[34,0.0],[35,0.0],[36,0.0],[37,0.0],[38,0.0],[39,0.0],[40,0.0],[41,0.0],[42,0.0],[43,0.0],[44,0.0],[45,0.0],[46,0.0],[47,0.0],[48,0.0],[49,0.0],[50,0.0],[51,0.0],[52,0.0],[53,0.0],[54,0.0],[55,0.0],[56,0.0],[57,0.0],[58,0.0],[59,0.0],[60,0.07480111072935358],[61,0.1590838432533772],[62,0.24074655228989394],[63,0.3253538308109221],[64,0.4128828655697703],[65,0.5078119350334305],[66,0.6069402379299618],[67,0.7118593487163432],[68,0.8151994537089979],[69,0.9370574305966528],[70,1.0561042256298847],[71,1.1850994849542076],[72,1.3242735449249494],[73,1.4728726726807329],[74,1.6257355860336522],[75,1.8053158565818705],[76,1.9854204653612726],[77,2.182149185680089],[78,2.39199824167909],[79,2.6237608056706465],[80,2.882982247778818],[81,3.142407264528742],[82,3.452374143033287],[83,3.760300468298965],[84,4.120287965621287],[85,4.516894867007885],[86,4.939085701653874],[87,5.420173527645204],[88,5.958248758647358],[89,6.5628289601527285],[90,7.279485214856209],[91,8.091423271960593],[92,9.079898099892034],[93,10.232357894908318],[94,11.663700622589438],[95,13.52777516389898],[96,16.035740381190067],[97,19.706414259452632],[98,25.958866473356455],[99,41.740637241422625],[100,28429.034453023916]]",28429.034453023916,"zeetix.20220817.nyTimesMetaDataset.cumulativeDeathCountPer100MPersonMileAnomalies"
false,"isHotSpot","zeetix.20220817.hotSpotMetaDataset","",true,"zeetix.20220817.hotSpotMetaDataset.isHotSpot"

Hi @tms 

CSV character representation may or may not have double quotes.
Without double quotes, it is safe to use item-by-item because the data type is indistinguishable.
So I have created an application (csv2cypher) that looks at the csv values and creates a LOAD CSV code.

tms
Graph Buddy


I apologize for the broken nesting. Perhaps a moderator can clean it up

I'm not sure what you mean by "the data type is indistinguishable". Without quotes, the existing LOAD CSV infers the type from the data, at least for numeric (integer and float) types -- in the example I provided above, the minimumValue and maximumValue fields are correctly coerced to integer and float values as expected.

The issue is that the regex (I assume that's the mechanism) isn't able to infer a boolean value. I don't try to pass string types without quotes, so I don't know if that works or not.

I appreciate your attention. This highlights why csv import and export quickly gets hairy. 🙂

Hi @tms 

If the string is not enclosed in double quotes, the boolean value is incomprehensible.
True,"String" (True is a boolean)
True,String (True is boolean or string)

tms
Graph Buddy

Heh. I have written code that says that: the value: true (without quotes) is arbitrarily coerced to the boolean value true. When quotes are present ("true"), then that value is coerced to a string.

I suggest that it isn't "incomprehensible", it is instead arbitrary. The resulting heuristic is "If you want to get a string, then delimit the value with double-quote."

It sounds to me as though this is either an oversight or a bug in the Cypher parser. There is no technical issue that blocks coercing boolean values -- it is no different from recognizing and coercing integer and float values.

FWIW, there is at least one attempt at an RFC:  (https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml)

In the "General" section of that, the last bullet reads:

"In some uses, there is an assumption of strong data typing, with unquoted fields considered to be numeric, and quoted fields considered to be text data."

I suggest merely that there is no technical barrier to treating an unquote field containing true or false as a Boolean value. For extra credit -- the "apoc.export.csv.query" could use the strong typing of the underlying Java representation to not quote numeric and boolean property values.

I suppose at the end of the day it would have been easier for me to export and import json -- that probably moots the question. 🙂