Import via http increasingly slow on a large dataset

stefano · July 3, 2019, 3:52pm

I am importing several datasets of which a couple of large ones (500K-1.5M documents of the type shown below), and I have an index created with the command

"CREATE CONSTRAINT ON (n:_) ASSERT n.id IS UNIQUE;"

Indexing time goes from a couple of ms per document at the beginning to currently 1s per document (import still running).

Is this normal? How can this be prevented?

neo4j version, desktop version, browser version
3.3.0 Community
what kind of API / driver do you use
http /db/data/transaction/commit
screenshot of PROFILE or EXPLAIN with boxes expanded (lower right corner)
a sample of the data you want to import

"parameters": {
                "operation": "add",
                "dataset": "bag",
                "type": "hg:Building",
                "id": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
                "data": {
                    "uri": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
                    "type": "hg:Building",
                    "validSince": [
                        "1977-01-01",
                        "1977-12-31"
                    ],
                    "geometry": "{\\"type\\":\\"Polygon\\",\\"coordinates\\":[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]}",
                    "dataset": "bag",
                    "validSinceTimestamp": 220924800
                },
                "structure": "node"
            },
            "statement": "MERGE (n:_ {id: {id}})\\nON CREATE\\n   SET n = {data},\\n       n.created = timestamp(),\\n       n.id = {id},\\n       n.dataset = {dataset},\\n       n:`hg:Building`\\nON MATCH\\n   SET n = {data},\\n       n.accessTime = timestamp(),\\n       n.counter = coalesce(n.counter, 0) + 1,\\n       n.id = {id},\\n       n.dataset = {dataset},\\n       n:`hg:Building`\\n   REMOVE n:_VACANT\\nRETURN n\\n"
        },

which plugins / extensions / procedures do you use
neo4j.log and debug.log

david_rosenblum · July 3, 2019, 8:54pm

I tried to run this from the browser by setting the parameter (has to be run on one line). I couldn't get the n={data} to work with the error TypeError: Property values can only be of primitive types or arrays thereof, but IMO that is a red herring, if your query is working, there must be some difference in the param usage in Browser vs. the HTTP call you are doing.

I think when you change the Label from _ or Vacant to hg:building you don't have an index or constraint on hg:building. Try adding that constraint to UNIQUE to hg:building.id.

When the label with the constraint is used, the index is used

When I get hg:building it is not. There is a node scan.

This will cause things to get slower as the number of nodes for the non-indexed label increases.

Merge I used
MERGE (n:Problem1 {id: {id}})
ON MATCH SET n.accessTime = timestamp(), n.counter = coalesce(n.counter, 0) + 1, n.id = {id}, n.dataset = {dataset}
,n.uri=$data.uri,n:hg:Building
RETURN n

Params:

:params {operation: "add",dataset: "bag",type: "hg:Building",id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",data: {uri: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",type: "hg:Building",validSince: ["1977-01-01","1977-12-31"],geometry: {type:"Polygon",coordinates:[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]},dataset: "bag",validSinceTimestamp: 220924800},structure: "node"}

stefano · July 10, 2019, 4:27pm

Hi David,
Thank you very much for your reply. Inspired from what you did I have also run an EXPLAIN query on my DB

EXPLAIN MERGE (n:_ {id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084"})
ON CREATE
   SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
       n.dataset = "bag",
       n:`hg:Building`

ON MATCH
   SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
       n.dataset = "bag",
       n:`hg:Building`
   REMOVE n:_VACANT

RETURN n

and here is the result:

Only I do not understand where the NodeByLabelScan comes from unless something is seriously wrong, since I just want to remove the label of the found node, and not of any arbitrary node in the DB.

Do I have an error in the query?

And further, this problem was not showing up before we changed the id from a short string to a full URL, can this be related to the increased indexing time?

Thanks,
Stefano

david_rosenblum · July 22, 2019, 8:33pm

Hi Stefano,

Try creating an index on the other Labels you are accessing id with. E.g _ or _Vacant. The table scan indicates that the query doesn't have an index for that label.

David

stefano · July 24, 2019, 10:08am

Hi David,
Thank you for your reply, I was convinced I had already created an index but the script that was supposed to do so did not run correctly.

Thanks again for your support.

Stefano

md7 · December 10, 2020, 7:13am

Hi Stefano,
One key point, while creating indexes AND on the same time importing data.
If data on which you are creating index is HUGE data then the process of populating the index data may take some time to complete. So its better to check the status of index (ONLINE, POPULATING or FAILED) make sure its ONLINE once all data is indexed.

On PROFILE your query, if NodeByLabelScan comes (even tough you had indexes) then that means, either indexing not yet complete or it may have failed. Always best practice to see failureMessage in verbose mode.

Topic		Replies	Views
Speed up database creation? Neo4j Graph Platform	11	1046	May 21, 2021
LOAD CSV taking time Import / Export cypher , import	6	708	September 25, 2021
CREATE INDEX ON :Label(prop) is really slow after neo4j-admin import Import / Export	4	2079	June 8, 2020
Data not inserting into neo4j Operations	21	3681	December 24, 2018
Creating index on already imported data Operations index	2	919	December 28, 2018

July Summer Fun!

Import via http increasingly slow on a large dataset

Related topics