I try to get better performance with csv import/merge of relations. I have a database created with neo4j-import from CSV files, and on a daily basis there will be some updates to the CSV files.
I can easily import the changes using merge with
LOAD CSV WITH HEADERS FROM "file:///some.csv" AS row
MERGE (c:nType { uuid: row.uuid, name: row.name, revision: toInt(row.revision) }) ;
which also works pretty fast, even for CSV files with 160000 entries but only some are changed.
But when I try to do the same with relations matching onto the uuid part:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///tlpdb/48871/out/edge-contains.csv" AS row
MATCH (p { uuid: row.`:START_ID`}), (q { uuid: row.`:END_ID` } )
MERGE (p)-[r:contains]->(q) ;
then it takes about 11min for a simple csv file with 3491 new relations:
0 rows available after 708392 ms, consumed after another 0 ms
Created 3491 relationships
I didn't start with the csv file containing 164690 lines (but most are already present).
I have created an index (and it is online) on uuid as well as (name, revision).
This is with Neo4j 3.4.8 running on Debian/sid.
Do I need to set up another index? The uuids are unique across all nodes.
Indexes are only used when both the label and property that are indexed are present in your match pattern.
This: MATCH (p { uuid: row.:START_ID}), (q { uuid: row.:END_ID} )
doesn't have labels present on either of these, so an index won't be used. It's instead doing an all nodes scan for both, and accessing the properties of all the nodes in your db twice to fulfill this single match.
Add in the label for the index, and double-check by running an EXPLAIN of your query plan.
Thanks, indeed. But then if p and q label (node type) can be in a strict subset of all node types? I have 5 node types: p:Package, p:Collection, p:Scheme, ... and I want to restrict the search to say only Package and Collection.
I found a faster way but it collapses all the node types into one, and distinguishes them via attributes. That way the index runs over all possible nodes and the merge is very fast.
Is there another way to speed this up without collapsing node types and searching across node types? That is, emulating something like and index over mutliple node types (labels)?