Merge of few thousand relations is very slow


(Norbert Preining) #1

Dear all,

I try to get better performance with csv import/merge of relations. I have a database created with neo4j-import from CSV files, and on a daily basis there will be some updates to the CSV files.

I can easily import the changes using merge with

LOAD CSV WITH HEADERS FROM "file:///some.csv" AS row
  MERGE (c:nType { uuid: row.uuid, name: row.name, revision: toInt(row.revision) }) ;

which also works pretty fast, even for CSV files with 160000 entries but only some are changed.

But when I try to do the same with relations matching onto the uuid part:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///tlpdb/48871/out/edge-contains.csv" AS row
  MATCH (p { uuid: row.`:START_ID`}), (q { uuid: row.`:END_ID` } )
  MERGE (p)-[r:contains]->(q) ;

then it takes about 11min for a simple csv file with 3491 new relations:

0 rows available after 708392 ms, consumed after another 0 ms
Created 3491 relationships

I didn't start with the csv file containing 164690 lines (but most are already present).

I have created an index (and it is online) on uuid as well as (name, revision).

This is with Neo4j 3.4.8 running on Debian/sid.

Do I need to set up another index? The uuids are unique across all nodes.

Thanks for any suggestion

Norbert


(Andrew Bowman) #2

Indexes are only used when both the label and property that are indexed are present in your match pattern.

This:
MATCH (p { uuid: row.:START_ID}), (q { uuid: row.:END_ID} )
doesn't have labels present on either of these, so an index won't be used. It's instead doing an all nodes scan for both, and accessing the properties of all the nodes in your db twice to fulfill this single match.

Add in the label for the index, and double-check by running an EXPLAIN of your query plan.


(Norbert Preining) #3

Thanks, indeed. But then if p and q label (node type) can be in a strict subset of all node types? I have 5 node types: p:Package, p:Collection, p:Scheme, ... and I want to restrict the search to say only Package and Collection.

I found a faster way but it collapses all the node types into one, and distinguishes them via attributes. That way the index runs over all possible nodes and the merge is very fast.

Is there another way to speed this up without collapsing node types and searching across node types? That is, emulating something like and index over mutliple node types (labels)?

Thanks


(Michael Hunger) #4

You can use a higher level label (you can use multiple labels for nodes), e.g. :Component


(Norbert Preining) #5

Thanks, yes that is what I am going for (thus my other question about adding multiple labels).