We all probably know that conditional tests in Cypher break the "lightning-fast" CSV loader, and if you've probably read about the:
CASE WHEN <condition> THEN [1] ELSE [] AS foo FOREACH (x IN foo | SET <something>)
...hack, as detailed on:
- Neo4j / Cypher - Conditional set/create/etc statement based on count (or any previous query statement) - Stack Overflow
- Creating Conditional Statements with Cypher
The above hack is such a nightmare for performance tuning (and syntax/sanity) that I have given up on it, and instead wrote a script to generate my cypher code making several (~10) passes over the same 1Gb CSV file, with a template similar to this.
USING PERIODIC COMMIT LOAD CSV WITH HEADERS
FROM "file:/tmp/foo.csv" AS foo
WITH foo WHERE (foo.column_x <> "") AND (foo.column_y <> "")
MERGE (x:LabelX {prop:foo.column_x})
MERGE (y:LabelY {prop:foo.column_y})
MERGE (x)-[xy:RelationXY]->(y)
// ...etc...
;
USING PERIODIC COMMIT LOAD CSV WITH HEADERS
FROM "file:/tmp/foo.csv" AS foo
WITH foo WHERE (foo.column_x <> "") AND (foo.column_z <> "")
MERGE (x:LabelX {prop:foo.column_x})
MERGE (z:LabelZ {prop:foo.column_z})
MERGE (x)-[xz:RelationXZ]->(z)
// ...etc...
;
// ...repeat ad nauseam; make sure to use appropriate INDEX-es...
I can't speak to the relative performance of repeated LOAD CSV ... AS foo WITH foo WHERE ...
compared to breaking-out the CSV into separate X-Y and X-Z relationship CSV files and importing those with no conditionals; but this is for an ETL process which I am doing weekly, and pre-prepping the data for a one-off import would probably eat into any saved time anyway.
Honestly, I feel that this is an area which Neo4J could do better; the fact that the CASE/WHEN/FOREACH hack exists, points towards a user need even if it degrades performance. I would love to see heavier documentation towards addressing this kind of user need in a neo-friendly manner. Hopefully the above is onesuch.
-a