We import approx 250,000 records into our Neo4J database on a daily basis. The records are split into approx 30 datasets each of which focusses on a single type of data. eg: Systems
, People
or Servers
.
The processing was not a fast as we would have liked!
In an earlier attempt to improve performance we ran a read-before-write process to avoid writing unchanged records. Unfortunately that lead us to use individual writes. We now just use a 'batched' process where each CYPHER transaction contains approx 1000 records.
We have gained a 100x performance increase but we were wondering if we have written the optimum CYPHER for those 'batch' transactions.
A typical batch of data looks like this. We chose to split properties from relationships so we can use UNWIND:
[{ _type: 'Testing', code: `01-parent-batch-1000`, properties: { name: `01-parent-name-batch-1000`, description: `this is batch test 01 of 1000 snippets`, lifecycleStage: `01-preproduction`, serviceTier: `01-unsupported`, }, relationships: [ { name: 'CONNECTED_TO', _type: 'Testing', code: `01-connected-child-batch-1000`, rich: { propOne: `first rich prop for 01`, propTwo: `second rich prop for 01` } }, { name: 'ALSO_TO', _type: 'Testing', code: `01-also-child-batch-1000`, }, ], },.....]
Our CYPHER looks like this:
UNWIND $payloads AS payload CALL apoc.merge.node([payload._type], {code:payload.code}) YIELD node AS p SET p += payload.properties WITH p, payload UNWIND payload.relationships as relationship CALL apoc.merge.node([relationship._type], {code:relationship.code}) YIELD node AS c WITH p,c,relationship CALL apoc.merge.relationship(p,relationship.name,relationship.rich,null,c,null) YIELD rel RETURN p,rel,c
The key reason why we have questioned the syntax/performance of the above is that it doesnt handle deletions of relationships - it keeps merging relationships instead of replacing the existing relationships with the newer ones.
Has anyone developed a generic importer or can see any improvements to the above?
Thanks
Geoff