Tip: Avoiding Slow & Messy Conditionals (or: splitting input) in Cypher for bulk import LOAD CSV?

alec.muffett · November 21, 2018, 10:58am

We all probably know that conditional tests in Cypher break the "lightning-fast" CSV loader, and if you've probably read about the:

CASE WHEN <condition> THEN [1] ELSE [] AS foo FOREACH (x IN foo | SET <something>)

...hack, as detailed on:

Neo4j / Cypher - Conditional set/create/etc statement based on count (or any previous query statement) - Stack Overflow
Creating Conditional Statements with Cypher

The above hack is such a nightmare for performance tuning (and syntax/sanity) that I have given up on it, and instead wrote a script to generate my cypher code making several (~10) passes over the same 1Gb CSV file, with a template similar to this.

USING PERIODIC COMMIT LOAD CSV WITH HEADERS
FROM "file:/tmp/foo.csv" AS foo
WITH foo WHERE (foo.column_x <> "") AND (foo.column_y <> "")
MERGE (x:LabelX {prop:foo.column_x})
MERGE (y:LabelY {prop:foo.column_y})
MERGE (x)-[xy:RelationXY]->(y)
// ...etc...
;
USING PERIODIC COMMIT LOAD CSV WITH HEADERS
FROM "file:/tmp/foo.csv" AS foo
WITH foo WHERE (foo.column_x <> "") AND (foo.column_z <> "")
MERGE (x:LabelX {prop:foo.column_x})
MERGE (z:LabelZ {prop:foo.column_z})
MERGE (x)-[xz:RelationXZ]->(z)
// ...etc...
;
// ...repeat ad nauseam; make sure to use appropriate INDEX-es...

I can't speak to the relative performance of repeated LOAD CSV ... AS foo WITH foo WHERE ... compared to breaking-out the CSV into separate X-Y and X-Z relationship CSV files and importing those with no conditionals; but this is for an ETL process which I am doing weekly, and pre-prepping the data for a one-off import would probably eat into any saved time anyway.

Honestly, I feel that this is an area which Neo4J could do better; the fact that the CASE/WHEN/FOREACH hack exists, points towards a user need even if it degrades performance. I would love to see heavier documentation towards addressing this kind of user need in a neo-friendly manner. Hopefully the above is onesuch.

-a

michael.hunger · November 21, 2018, 11:16am

Thanks a lot for your feedback you're totally right for denormalized import files.

I would even go one step further and split node-creation and relationship creation.
At least node-creation can then also be parallelized.

Did you create a generalized script that basically uses a CSV -> graph mapping? (Similar to the import tool)?
B/c you would also want to set some columns onto the nodes.

Some time ago, my colleague @lyonwj built a tool to do this online, and generating the appropriate cypher You could even use his tool with a sample (head -10 file.csv) of your file, and grab the generated Cypher scripts and use them with the full file.

https://neo4j-csv-import.herokuapp.com/

We also have a procedure in APOC that does this for you: apoc.import.csv

https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_import_csv

Eventually, I'd love to have a proper graph model coming from a modeling tool, that you'd map your input (e.g CSV, JSON, RDBMS) to (e.g. visually).

alec.muffett · November 21, 2018, 11:19am

Hi Michael!

I'm still low on coffee this morning, but if I MERGE all the nodes in one set of passes over the input file, and then MATCH them in a second set of passes to MERGE the relationships, you reckon that will be faster - I/O bandwidth permitting?

I'm asking because I am on a quad-core machine and typically do not see more than 1 core being used, so therefore I presume there's some big threading lock getting in the way?

michael.hunger · November 23, 2018, 12:33am

You can do the merges of different labels in parallel.
It's also safer / simpler.

Neo4j locks both nodes for a relationship create/delete.

Topic		Replies	Views
Making 'if' type conditionals in Cypher Newbie Questions	4	562	December 8, 2020
Data modelling in Neo4j Modeling cypher , import	6	486	December 31, 2020
Relationship Load Failure Cypher	4	920	May 25, 2019
Create new node/relationship based on availability of certain attributes in LOAD CSV Cypher load-csv	4	716	July 4, 2020
ETL for loading csv ETL-Tool	1	405	July 11, 2020

Tip: Avoiding Slow & Messy Conditionals (or: splitting input) in Cypher for bulk import LOAD CSV?

Related topics