Tip: Avoiding Slow & Messy Conditionals (or: splitting input) in Cypher for bulk import LOAD CSV?

etl
cypher
load-csv
import

(Alec Muffett) #1

We all probably know that conditional tests in Cypher break the "lightning-fast" CSV loader, and if you've probably read about the:

CASE WHEN <condition> THEN [1] ELSE [] AS foo FOREACH (x IN foo | SET <something>)

...hack, as detailed on:

The above hack is such a nightmare for performance tuning (and syntax/sanity) that I have given up on it, and instead wrote a script to generate my cypher code making several (~10) passes over the same 1Gb CSV file, with a template similar to this.

USING PERIODIC COMMIT LOAD CSV WITH HEADERS
FROM "file:/tmp/foo.csv" AS foo
WITH foo WHERE (foo.column_x <> "") AND (foo.column_y <> "")
MERGE (x:LabelX {prop:foo.column_x})
MERGE (y:LabelY {prop:foo.column_y})
MERGE (x)-[xy:RelationXY]->(y)
// ...etc...
;
USING PERIODIC COMMIT LOAD CSV WITH HEADERS
FROM "file:/tmp/foo.csv" AS foo
WITH foo WHERE (foo.column_x <> "") AND (foo.column_z <> "")
MERGE (x:LabelX {prop:foo.column_x})
MERGE (z:LabelZ {prop:foo.column_z})
MERGE (x)-[xz:RelationXZ]->(z)
// ...etc...
;
// ...repeat ad nauseam; make sure to use appropriate INDEX-es...

I can't speak to the relative performance of repeated LOAD CSV ... AS foo WITH foo WHERE ... compared to breaking-out the CSV into separate X-Y and X-Z relationship CSV files and importing those with no conditionals; but this is for an ETL process which I am doing weekly, and pre-prepping the data for a one-off import would probably eat into any saved time anyway.

Honestly, I feel that this is an area which Neo4J could do better; the fact that the CASE/WHEN/FOREACH hack exists, points towards a user need even if it degrades performance. I would love to see heavier documentation towards addressing this kind of user need in a neo-friendly manner. Hopefully the above is onesuch.

-a


(Michael Hunger) #2

Thanks a lot for your feedback you're totally right for denormalized import files.

I would even go one step further and split node-creation and relationship creation.
At least node-creation can then also be parallelized.

Did you create a generalized script that basically uses a CSV -> graph mapping? (Similar to the import tool)?
B/c you would also want to set some columns onto the nodes.

Some time ago, my colleague @William_Lyon built a tool to do this online, and generating the appropriate cypher You could even use his tool with a sample (head -10 file.csv) of your file, and grab the generated Cypher scripts and use them with the full file.

https://neo4j-csv-import.herokuapp.com/

We also have a procedure in APOC that does this for you: apoc.import.csv

https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_import_csv

Eventually, I'd love to have a proper graph model coming from a modeling tool, that you'd map your input (e.g CSV, JSON, RDBMS) to (e.g. visually).


Question: Dense Nodes with Millions of Relations?
(Alec Muffett) #3

Hi Michael!

I'm still low on coffee this morning, but if I MERGE all the nodes in one set of passes over the input file, and then MATCH them in a second set of passes to MERGE the relationships, you reckon that will be faster - I/O bandwidth permitting?

I'm asking because I am on a quad-core machine and typically do not see more than 1 core being used, so therefore I presume there's some big threading lock getting in the way?


(Michael Hunger) #4

You can do the merges of different labels in parallel.
It's also safer / simpler.

Neo4j locks both nodes for a relationship create/delete.