Import via http increasingly slow on a large dataset

I am importing several datasets of which a couple of large ones (500K-1.5M documents of the type shown below), and I have an index created with the command

"CREATE CONSTRAINT ON (n:_) ASSERT n.id IS UNIQUE;"

Indexing time goes from a couple of ms per document at the beginning to currently 1s per document (import still running).

Is this normal? How can this be prevented?

  • neo4j version, desktop version, browser version
    3.3.0 Community
  • what kind of API / driver do you use
    http /db/data/transaction/commit
  • screenshot of PROFILE or EXPLAIN with boxes expanded (lower right corner)
  • a sample of the data you want to import
"parameters": {
                "operation": "add",
                "dataset": "bag",
                "type": "hg:Building",
                "id": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
                "data": {
                    "uri": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
                    "type": "hg:Building",
                    "validSince": [
                        "1977-01-01",
                        "1977-12-31"
                    ],
                    "geometry": "{\\"type\\":\\"Polygon\\",\\"coordinates\\":[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]}",
                    "dataset": "bag",
                    "validSinceTimestamp": 220924800
                },
                "structure": "node"
            },
            "statement": "MERGE (n:_ {id: {id}})\\nON CREATE\\n   SET n = {data},\\n       n.created = timestamp(),\\n       n.id = {id},\\n       n.dataset = {dataset},\\n       n:`hg:Building`\\nON MATCH\\n   SET n = {data},\\n       n.accessTime = timestamp(),\\n       n.counter = coalesce(n.counter, 0) + 1,\\n       n.id = {id},\\n       n.dataset = {dataset},\\n       n:`hg:Building`\\n   REMOVE n:_VACANT\\nRETURN n\\n"
        },
  • which plugins / extensions / procedures do you use
  • neo4j.log and debug.log

I tried to run this from the browser by setting the parameter (has to be run on one line). I couldn't get the n={data} to work with the error TypeError: Property values can only be of primitive types or arrays thereof, but IMO that is a red herring, if your query is working, there must be some difference in the param usage in Browser vs. the HTTP call you are doing.

I think when you change the Label from _ or Vacant to hg:building you don't have an index or constraint on hg:building. Try adding that constraint to UNIQUE to hg:building.id.

When the label with the constraint is used, the index is used

When I get hg:building it is not. There is a node scan.

This will cause things to get slower as the number of nodes for the non-indexed label increases.

Merge I used
MERGE (n:Problem1 {id: {id}})
ON MATCH SET n.accessTime = timestamp(), n.counter = coalesce(n.counter, 0) + 1, n.id = {id}, n.dataset = {dataset}
,n.uri=$data.uri,n:hg:Building
RETURN n

Params:

:params {operation: "add",dataset: "bag",type: "hg:Building",id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",data: {uri: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",type: "hg:Building",validSince: ["1977-01-01","1977-12-31"],geometry: {type:"Polygon",coordinates:[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]},dataset: "bag",validSinceTimestamp: 220924800},structure: "node"}

Hi David,
Thank you very much for your reply. Inspired from what you did I have also run an EXPLAIN query on my DB

EXPLAIN MERGE (n:_ {id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084"})
ON CREATE
   SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
       n.dataset = "bag",
       n:`hg:Building`

ON MATCH
   SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
       n.dataset = "bag",
       n:`hg:Building`
   REMOVE n:_VACANT

RETURN n

and here is the result:

Only I do not understand where the NodeByLabelScan comes from unless something is seriously wrong, since I just want to remove the label of the found node, and not of any arbitrary node in the DB.

Do I have an error in the query?

And further, this problem was not showing up before we changed the id from a short string to a full URL, can this be related to the increased indexing time?

Thanks,
Stefano