Json loading is slow after some time

Hello Everyone,

I am trying to load some thousands JSON in neo4j with the help of python script and apoc.json.load.
Initially, for around 1000 JSON, my scripts work fine and loading data very fast but after some time this loading becomes extremely slow.
for the initial 1000, it took only 40 sec but after that, it is taking around 10 sec for each JSON.
is there anything so I can do to speed up my operation.

overall size of all JSON files are 750 MB and i am using 8GB macine to load this data

There are a number of things you can do to make sure you data imports go smoothly. Using Periodic Commit, Merge creating just the first property of your created Nodes, and ensuring that you have enough memory allocated to the process are all contributing factors. Here's a link to the official docs https://neo4j.com/docs/labs/apoc/current/import/load-json/

I have taken care of most of the mentioned suggestions.
Periodic commit would not work in my case because in my scenario i don't have much nodes .each JSON contains around 5k nodes and 7k relationship. i have 14000 json files and each JSON contain this much of data. i am using for loop to iterate each json and each JSON passing to apoc.load.json.
is there any way to iterate in diff manner and what about memory configuration .
for 700 MB of data what properties i need to defile in neo4j,conf ?

You can set the number of periodic commits to a lower number, like 500. You can also convert the JSON into CSV which I've seen recommended here https://neo4j.com/blog/bulk-data-import-neo4j-3-0/. As for memory in your neo4j.conf settings there are a number of settings like heap, etc. you can increase to help with load.

Here's another load json resource https://neo4j.com/blog/cypher-load-json-from-url/

Do you have proper indexes created? That could explain the behavior you are seeing.

Without indexes, when there is no data, the ingestion would be faster. As data starts growing the ingestion will start slowing down.

Is indexing causes problem ?
i have created the indexes before loading. i am loading 3 type of lables and 2 type of relationships. index already created for these labels.

Yes ..
As mentioned by @anthapu indexes always cause latency in load. Infact in DW project for all bulk load indexes are removed before historical load and applied post load.

Updating indexes does not cause an issue. If your query cannot use index then it can cause the queries to run slower.

Would it be possible to list down your queries and indexes you created to see why they could be slower?

@intouch.vivek That's incorrect.

For large data loads using Cypher, if you do not have indexes it can take lot of time, unless you are only using CREATE statements.

Index updates do not have too much of side-effect. So creating indexes before you do the load is very important.

yeah sure.
but after your answer, I tried without indexes as well but still, again the same issue occurs.

Please find below queries which I am using.

call apoc.load.json("file:///Neo4j/blocks_set_001/'''+f_name+'''") YIELD value
                            MERGE (m:BLK{blk_id: value.id}) 
                                SET m.confirmations = value.confirmations,
                                    m.strippedsize = value.strippedsize ,
                                    m.time = value.time,
                                    m.weight = value.weight,
                                    m.mediantime = value.mediantime,
                                    m.difficulty = value.difficulty
                            FOREACH (node in value.tx + [value] | 
                                MERGE (e:TX{txid: node.id}) 
                                    SET  e.tx_locktime = node.locktime ,
                               e.time = apoc.date.format(value.time, 's', 'yyyy/MM/dd HH:mm:ss'),
                                        e.tx_version = node.version,
                                        e.tx_size = node.size,
                                        e.tx_vsize = node.vsize,
                                        e.tx_weight = node.weight
                                FOREACH(out in node.vout |
                                    FOREACH(adrc in out.key.addresses|
                                        MERGE (e:ADDRESS{address:adrc})
                                            ))) with value
        UNWIND value.tx as transaction
            UNWIND transaction.vout as out
                MERGE (n)-[r:OUT]->(m) 
                ON CREATE SET r.id= value.id, 
                              r.output_value = out.value,
                              r.index_out = out.n   

please suggest me on what i have to create index and what wrong in query

@anthapu, Indexes might not be the problem here..

However as per i know index is useful in data fetch but not in data ingestion..
At the time of data ingestion if index is enabled then DB has to manage that also along with storing data.

This depends on how you're ingesting the data. Anthapu noted correctly that if you're only using CREATE statements, then the indexes won't be useful here, and that's more inline with your thinking on this.

However, in the query given we can see MERGE operations on the node, and since MERGE is like a MATCH (and a CREATE if no match is found), this does incorporate data fetching as well, and without indexes on the following, the import will get slower as more nodes are added:


I tried in both ways but getting same result.
with indexes ,it's loading slow and without indexes also.
what wrong i am doing here .
is memory causes the problem ? I am running this script on 8GB of ram and using default neo4j.cong file

Hi @andrew.bowman @anthapu

I noticed that nodes are not the problem in this.
this slowness coming through relationship. I have separated the logic of nodes and rels and got to know that nodes ingestion is very fast (4 min for 200000 nodes) but rels taking so much time. is there any way to fasten this process ??
using below query for rels creation

call apoc.load.json("file:///Neo4j/blocks_set_001/'''+f_name+'''") YIELD value
UNWIND value.tx as transaction
            UNWIND transaction.vout as out
                MERGE (n)-[r:OUT]->(m) 
                ON CREATE SET r.id= value.id, 
                              r.output_value = out.value,
                              r.index_out = out.n   

Hi Kunal,
When you create a relationship nodes on both sides are locked to make sure to stay ACID compliant. So, it is possible your requests getting serialized if you are running them in parallel.

I am getting the same issue. I tried to split a file of 500mb in size to 18 smaller files.
the first few ingest took relatively short times. but as we hit the 11th file, the load times significantly slowed down.

I have index on all the nodes in question for the merge operations.
Is there any OTHER way we can speed up data ingestion for both load csv and load json operations?

Maybe this could help, i found from somewhere that you can insert ranges of records to ingest:
CALL apoc.load.json(url, '[0:1000]' ) YIELD value AS article

I switched to using CALL apoc.periodic.iterate and it did help somewhat, but eventually, as the size of the nodes grew, i find that loading json files also grow exponentially:
execution (in seconds) for each batch of json files to ingest:

I have also created indexes to help speed up the ingestion:
// create index to speed up load json operations
create index for (m:Author) on (m.First,m.Last,m.Orcid)
create index for (m:Institution) on (m.Name)
create index for (m:Publication) on (m.DOI)

Here is the main code inside the apoc.periodic.iterate:

//load json table into NEO4j
CALL apoc.load.json(‘file:///list_data_0.json’) yield value
UNWIND value.items as insti WITH insti, insti.reference as references
UNWIND references as refs WITH insti, insti.author as authorname, count(refs) as cntref
UNWIND authorname as authors WITH insti, cntref, count(authors) as cntauthors

// Create publication
MERGE (p:Publication{
Count: insti.is-referenced-by-count,
DateTime: insti.indexed.date-time,
Score: coalesce(insti.score,’NONE’),
Issntype: coalesce(insti.issn-type[0].type,’NONE’),
IssnValue: coalesce(insti.issn-type[0].value,’NONE’),
LinkURL:coalesce( insti.link[0].URL,’NONE’),

// create the references per publication
FOREACH (g in range(0,cntref) |
MERGE (k:Publication{key:coalesce(insti.reference[g].key,’NONE’), DOI:coalesce(insti.reference[g].DOI,’NONE’)})
MERGE (p)-[:References]->(k)

// create the authors node
// The foreach loop will cycle through all the authors listed in the paper

FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,’NONE’),Last:coalesce(insti.author[l].family,’NONE’),Orcid:coalesce(insti.author[l].ORCID,’NONE’)})

//create the affiliation institution
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,’NONE’)})

// Create the link where Author (a) belongs to Institution (i)
MERGE (a)-[r:BelongsTo]->(i)

// create the author to paper link
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,’NONE’)}]->(p)

Any help with further optimization is greatly appreciated. Thanks

I changed the code and execution is now much faster:

CALL apoc.load.json('file:///list_data_XX.json') yield value
UNWIND value.items as insti
WITH insti, insti.author as authorname
UNWIND authorname as authors
WITH insti,count(authors) as cntauthors
MERGE (p:Publication{DOI:coalesce(insti.DOI,'NONE')})
ON CREATE SET p.Count= insti.is-referenced-by-count,
p.DateTime= insti.indexed.date-time,p.IndexYear=insti.indexed.date-parts[0][0],
p.LinkURL=coalesce( insti.link[0].URL,'NONE'),

FOREACH (g in range(0,insti.reference-count) |
MERGE (k:Publication{DOI:coalesce(insti.reference[g].DOI,'NONE')})
ON CREATE SET k.key=coalesce(insti.reference[g].key,'NONE')
MERGE (p)-[:References]->(k))

FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,'NONE'),
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,'NONE')})
MERGE (a)-[r:BelongsTo]->(i)
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,'NONE')}]->(p)