Hey all,
I am using Community Edition Neo4j 3.5.6 and APOC 3.5.0.2 on a 16 core 64Gb machine with an additional 64Gb of swap (that is never used, but just there for safety)
My machine is split into sharing memory 20/20/20/4 as described in:
...this seems stable now; it's small, but this is a proof of concept machine.
I am importing about 500M objects and about 3B relationships, with about 100 reltypes and about 50 labels/objtypes; the imports are from $big gigabytes of compressed, database-generated CSV file.
The Cypher code for importing data is statically generated from a python script, using templates, because (literally) 10,000 CSV files produced by data warehouse, and each one has to have its own LOAD CSV because LOAD CSV cannot be iterated over input filenames (eg: via UNWIND).
These templated cypher scripts (some over 5000 lines long) contain code like:
USING PERIODIC COMMIT 256
LOAD CSV WITH HEADERS
FROM 'file:/path/foo001.csv.gz' AS batch
MERGE (a:Account { uid:toInteger(batch.INTEGER__ID) })
SET
a.status = toInteger(batch.INTEGER__STATUS);
or
USING PERIODIC COMMIT 256
LOAD CSV WITH HEADERS
FROM 'file:/path/bar001.csv' AS batch
MATCH (x:Account { uid:toInteger(batch.INTEGER__X) })
MATCH (y:Account { uid:toInteger(batch.INTEGER__Y) })
MERGE (x)-[edge:Linky]->(y)
ON CREATE SET edge.reason = batch.REASON
ON MATCH SET edge.reason = batch.REASON;
...simply repeated for files foo001
through foo300
and so forth.
I am aware of and am considering eventual migration to neo4j-admin import
as described here:
...which sounds amazing but I want to check that it fits my use case, which includes doing updates of pre-existing counters ON MATCH SET edge.weight = batch.more_weight + edge.weight
; as well as other stuff like that.
Also, I am using small transactions of 256
having read in a previous posting that small transactions tend to reduce collisions.
Further I am following the advice of (I think it was) @michael.hunger, elsewhere in this community, to ORDER/GROUP my CSV data in order to minimise deadlock issues across parallel transactions.
I am also following Michael's advice to load distinct labels, and distinct relationships, in parallel:
WITH ALL OF THIS CONTEXT: here's the question:
- I am collecting CSVs and importing them in phases, and executing them in parallel
- in any given phase, the set of objects being imported, or relationships being imported, are completely disjoint
- that is to say, I am importing (:A), (:B), (:C)-[:foo]->(:D) ... and in any given batch, none of A/B/C/foo/D are ever of the same type, and for each of A/B/C-foo-D the
LOAD CSV
is run serially for each shard/CSV input file.
In theory, there should be no lock contention? But no, I am wrong; it appears that LOAD CSV
can tread upon itself:
LockClient[13625877] can't wait on resource RWLock[NODE(93708867), hash=515089118] since => LockClient[13625877] <-[:HELD_BY]- RWLock[NODE(469290148), hash=686715308] <-[:WAITING_FOR]- LockClient[13626010] <-[:HELD_BY]- RWLock[NODE(93708867), hash=515089118]
LockClient[13626299] can't wait on resource RWLock[NODE(469054858), hash=2000485859] since => LockClient[13626299] <-[:HELD_BY]- RWLock[NODE(469317257), hash=1199527701] <-[:WAITING_FOR]- LockClient[13626312] <-[:HELD_BY]- RWLock[NODE(469054858), hash=2000485859]
LockClient[13630461] can't wait on resource RWLock[RELATIONSHIP(2319588353), hash=477452471] since => LockClient[13630461] <-[:HELD_BY]- RWLock[NODE(411438449), hash=1187551531] <-[:WAITING_FOR]- LockClient[13630397] <-[:HELD_BY]- RWLock[RELATIONSHIP(2319588353), hash=477452471]
NB: 2x NODE deadlocks + 1x RELATIONSHIP deadlock, in examples above.
...I have perhaps a dozen of these, from the latest [edit: small, delta] import, and it seems that the problem occurs when the/a nodes is "hot", and has several (tens? hundreds?) of thousands of relationships pointing to it; this is the sort of thing/requirement that I have discussed previously:
I am not sure how I can mitigate this, if at all; possibly it is related to the issue described here:
...or perhaps it's because of the sorting/grouping, that the number of relations linking:
(foo)-[:Bar]-(where:City {name:"London"})
...greatly exceeds one-or-more transactions-worth of space.
The annoying thing here is that I apparently cannot fix this; all of the commentary that I see online about deadlocks, "solves" them by saying something like "simply trap the exception in Java and re-run the transaction"; but as above, I don't have any of that.
Hence this posting, with its vast amounts of context. Have I missed anything, please, or is anything in the above somehow a stupid choice, given what I am trying to achieve?
Thanks!