Json loading is slow after some time

12kunal34 · February 25, 2020, 9:35am

Hello Everyone,

I am trying to load some thousands JSON in neo4j with the help of python script and apoc.json.load.
Initially, for around 1000 JSON, my scripts work fine and loading data very fast but after some time this loading becomes extremely slow.
for the initial 1000, it took only 40 sec but after that, it is taking around 10 sec for each JSON.
is there anything so I can do to speed up my operation.

overall size of all JSON files are 750 MB and i am using 8GB macine to load this data

MuddyBootsCode · February 25, 2020, 10:20am

There are a number of things you can do to make sure you data imports go smoothly. Using Periodic Commit, Merge creating just the first property of your created Nodes, and ensuring that you have enough memory allocated to the process are all contributing factors. Here's a link to the official docs https://neo4j.com/docs/labs/apoc/current/import/load-json/

12kunal34 · February 25, 2020, 10:34am

I have taken care of most of the mentioned suggestions.
Periodic commit would not work in my case because in my scenario i don't have much nodes .each JSON contains around 5k nodes and 7k relationship. i have 14000 json files and each JSON contain this much of data. i am using for loop to iterate each json and each JSON passing to apoc.load.json.
is there any way to iterate in diff manner and what about memory configuration .
for 700 MB of data what properties i need to defile in neo4j,conf ?

MuddyBootsCode · February 25, 2020, 10:51am

You can set the number of periodic commits to a lower number, like 500. You can also convert the JSON into CSV which I've seen recommended here Effective Bulk Data Import into Neo4j. As for memory in your neo4j.conf settings there are a number of settings like heap, etc. you can increase to help with load.

MuddyBootsCode · February 25, 2020, 10:57am

Here's another load json resource Cypher: LOAD JSON from URL AS Data - Neo4j Graph Data Platform

anthapu · February 25, 2020, 1:39pm

Do you have proper indexes created? That could explain the behavior you are seeing.

Without indexes, when there is no data, the ingestion would be faster. As data starts growing the ingestion will start slowing down.

12kunal34 · February 25, 2020, 6:04pm

Is indexing causes problem ?
i have created the indexes before loading. i am loading 3 type of lables and 2 type of relationships. index already created for these labels.

intouch_vivek · February 25, 2020, 6:10pm

Yes ..
As mentioned by @anthapu indexes always cause latency in load. Infact in DW project for all bulk load indexes are removed before historical load and applied post load.

anthapu · February 25, 2020, 6:52pm

Updating indexes does not cause an issue. If your query cannot use index then it can cause the queries to run slower.

Would it be possible to list down your queries and indexes you created to see why they could be slower?

anthapu · February 25, 2020, 6:58pm

@intouch_vivek That's incorrect.

For large data loads using Cypher, if you do not have indexes it can take lot of time, unless you are only using CREATE statements.

Index updates do not have too much of side-effect. So creating indexes before you do the load is very important.

12kunal34 · February 25, 2020, 6:59pm

yeah sure.
but after your answer, I tried without indexes as well but still, again the same issue occurs.

Please find below queries which I am using.

call apoc.load.json("file:///Neo4j/blocks_set_001/'''+f_name+'''") YIELD value
                            MERGE (m:BLK{blk_id: value.id}) 
                                SET m.confirmations = value.confirmations,
                                    m.strippedsize = value.strippedsize ,
                                    m.time = value.time,
                                    m.weight = value.weight,
                                    m.mediantime = value.mediantime,
                                    m.difficulty = value.difficulty
                                                    
                            FOREACH (node in value.tx + [value] | 
                                MERGE (e:TX{txid: node.id}) 
                                    SET  e.tx_locktime = node.locktime ,
                               e.time = apoc.date.format(value.time, 's', 'yyyy/MM/dd HH:mm:ss'),
                                        e.tx_version = node.version,
                                        e.tx_size = node.size,
                                        e.tx_vsize = node.vsize,
                                        e.tx_weight = node.weight
                                                        
                                FOREACH(out in node.vout |
                                    FOREACH(adrc in out.key.addresses|
                                        MERGE (e:ADDRESS{address:adrc})
                                            ))) with value
                                            
        UNWIND value.tx as transaction
            UNWIND transaction.vout as out
                MATCH(n:TX{txid:transaction.txid})
                MATCH(m:ADDRESS{address:out.key.addresses[0]})
                MERGE (n)-[r:OUT]->(m) 
                ON CREATE SET r.id= value.id, 
                              r.output_value = out.value,
                              r.index_out = out.n

please suggest me on what i have to create index and what wrong in query

intouch_vivek · February 25, 2020, 7:10pm

@anthapu, Indexes might not be the problem here..

However as per i know index is useful in data fetch but not in data ingestion..
At the time of data ingestion if index is enabled then DB has to manage that also along with storing data.

andrew_bowman · February 25, 2020, 9:09pm

This depends on how you're ingesting the data. Anthapu noted correctly that if you're only using CREATE statements, then the indexes won't be useful here, and that's more inline with your thinking on this.

However, in the query given we can see MERGE operations on the node, and since MERGE is like a MATCH (and a CREATE if no match is found), this does incorporate data fetching as well, and without indexes on the following, the import will get slower as more nodes are added:

:BLK(blk_id)
:TX(txid)
:ADDRESS(address)

12kunal34 · February 26, 2020, 4:31am

I tried in both ways but getting same result.
with indexes ,it's loading slow and without indexes also.
what wrong i am doing here .
is memory causes the problem ? I am running this script on 8GB of ram and using default neo4j.cong file

12kunal34 · February 27, 2020, 5:23am

Hi @andrew_bowman @anthapu

I noticed that nodes are not the problem in this.
this slowness coming through relationship. I have separated the logic of nodes and rels and got to know that nodes ingestion is very fast (4 min for 200000 nodes) but rels taking so much time. is there any way to fasten this process ??
using below query for rels creation

call apoc.load.json("file:///Neo4j/blocks_set_001/'''+f_name+'''") YIELD value
UNWIND value.tx as transaction
            UNWIND transaction.vout as out
                MATCH(n:TX{txid:transaction.txid})
                MATCH(m:ADDRESS{address:out.key.addresses[0]})
                MERGE (n)-[r:OUT]->(m) 
                ON CREATE SET r.id= value.id, 
                              r.output_value = out.value,
                              r.index_out = out.n

anthapu · February 28, 2020, 3:35pm

Hi Kunal,
When you create a relationship nodes on both sides are locked to make sure to stay ACID compliant. So, it is possible your requests getting serialized if you are running them in parallel.

wilson.chua · October 23, 2020, 3:27am

I am getting the same issue. I tried to split a file of 500mb in size to 18 smaller files.
the first few ingest took relatively short times. but as we hit the 11th file, the load times significantly slowed down.

I have index on all the nodes in question for the merge operations.
Is there any OTHER way we can speed up data ingestion for both load csv and load json operations?

wilson.chua · October 23, 2020, 4:01am

Maybe this could help, i found from somewhere that you can insert ranges of records to ingest:
CALL apoc.load.json(url, '[0:1000]' ) YIELD value AS article

wilson.chua · October 24, 2020, 7:41pm

I switched to using CALL apoc.periodic.iterate and it did help somewhat, but eventually, as the size of the nodes grew, i find that loading json files also grow exponentially:
execution (in seconds) for each batch of json files to ingest:
6415
7496
8179
12389
14385

I have also created indexes to help speed up the ingestion:
// create index to speed up load json operations
create index for (m:Author) on (m.First,m.Last,m.Orcid)
create index for (m:Institution) on (m.Name)
create index for (m:Publication) on (m.DOI)

Here is the main code inside the apoc.periodic.iterate:

//load json table into NEO4j
CALL apoc.load.json(‘file:///list_data_0.json’) yield value
UNWIND value.items as insti WITH insti, insti.reference as references
UNWIND references as refs WITH insti, insti.author as authorname, count(refs) as cntref
UNWIND authorname as authors WITH insti, cntref, count(authors) as cntauthors

// Create publication
MERGE (p:Publication{
Count: insti.is-referenced-by-count,
DateTime: insti.indexed.date-time,
IndexYear:insti.indexed.date-parts[0][0],
IndexMonth:insti.indexed.date-parts[0][1],
IndexDay:insti.indexed.date-parts[0][2],
Prefix:coalesce(insti.prefix,’NONE’),
DateDeposited:insti.deposited.date-time,
Type:coalesce(insti.type,’NONE’),
Title:coalesce(insti.title,’NONE’),
URL:coalesce(insti.URL,’NONE’),
Score: coalesce(insti.score,’NONE’),
ContainerTitle:coalesce(insti.container-title,’NONE’),
Restrictions:coalesce(insti.content-domain.crossmark-restriction,’NONE’),
Member:coalesce(insti.member,’NONE’),
DOI:coalesce(insti.DOI,’NONE’),Language:coalesce(insti.language,’NONE’),
Issntype: coalesce(insti.issn-type[0].type,’NONE’),
IssnValue: coalesce(insti.issn-type[0].value,’NONE’),
LinkURL:coalesce( insti.link[0].URL,’NONE’),
Subject:coalesce(insti.subject,’NONE’),
IntendedApplication:coalesce(insti.link[0].intended-application,’NONE’),
Publisher:coalesce(insti.publisher,’NONE’),
ContainerTitle:coalesce(insti.container-title,’NONE’)})

// create the references per publication
FOREACH (g in range(0,cntref) |
MERGE (k:Publication{key:coalesce(insti.reference[g].key,’NONE’), DOI:coalesce(insti.reference[g].DOI,’NONE’)})
MERGE (p)-[:References]->(k)
)

// create the authors node
// The foreach loop will cycle through all the authors listed in the paper

FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,’NONE’),Last:coalesce(insti.author[l].family,’NONE’),Orcid:coalesce(insti.author[l].ORCID,’NONE’)})

//create the affiliation institution
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,’NONE’)})

// Create the link where Author (a) belongs to Institution (i)
MERGE (a)-[r:BelongsTo]->(i)

// create the author to paper link
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,’NONE’)}]->(p)
)

Any help with further optimization is greatly appreciated. Thanks

wilson.chua · October 25, 2020, 3:59am

I changed the code and execution is now much faster:

CALL apoc.load.json('file:///list_data_XX.json') yield value
UNWIND value.items as insti
WITH insti, insti.author as authorname
UNWIND authorname as authors
WITH insti,count(authors) as cntauthors
MERGE (p:Publication{DOI:coalesce(insti.DOI,'NONE')})
ON CREATE SET p.Count= insti.is-referenced-by-count,
p.DateTime= insti.indexed.date-time,p.IndexYear=insti.indexed.date-parts[0][0],
p.IndexMonth=insti.indexed.date-parts[0][1],p.IndexDay=insti.indexed.date-parts[0][2],
p.Prefix=coalesce(insti.prefix,'NONE'),p.DateDeposited=insti.deposited.date-time,
p.Type=coalesce(insti.type,'NONE'),p.Title=coalesce(insti.title,'NONE'),
p.URL=coalesce(insti.URL,'NONE'),p.Score=coalesce(insti.score,'NONE'),
p.ContainerTitle=coalesce(insti.container-title,'NONE'),
p.Restrictions=coalesce(insti.content-domain.crossmark-restriction,'NONE'),
p.Member=coalesce(insti.member,'NONE'),
p.Language=coalesce(insti.language,'NONE'),
p.Issntype=coalesce(insti.issn-type[0].type,'NONE'),
p.IssnValue=coalesce(insti.issn-type[0].value,'NONE'),
p.LinkURL=coalesce( insti.link[0].URL,'NONE'),
p.Subject=coalesce(insti.subject,'NONE'),
p.IntendedApplication=coalesce(insti.link[0].intended-application,'NONE'),
p.Publisher=coalesce(insti.publisher,'NONE'),
p.Subtype=coalesce(insti.Subtype,'NONE'),
p.ContainerTitle=coalesce(insti.container-title,'NONE')

FOREACH (g in range(0,insti.reference-count) |
MERGE (k:Publication{DOI:coalesce(insti.reference[g].DOI,'NONE')})
ON CREATE SET k.key=coalesce(insti.reference[g].key,'NONE')
MERGE (p)-[:References]->(k))

FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,'NONE'),
Last:coalesce(insti.author[l].family,'NONE'),
Orcid:coalesce(insti.author[l].ORCID,'NONE')})
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,'NONE')})
MERGE (a)-[r:BelongsTo]->(i)
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,'NONE')}]->(p)
)

Topic		Replies	Views
How to speed up apoc json load Procedures & APOC apoc , performance , import	9	813	October 18, 2021
Need help to optimize json load performance Neo4j Graph Platform	11	316	February 8, 2024
How to speed up JSON data load? Neo4j Graph Platform memory , apocperiodiciterate	7	267	February 17, 2023
Help me merge 170M relationships with LOAD CSV Cypher load-csv	10	3639	October 23, 2019
Another "speed up the load" question from a relatively inexperienced Neo4j user Procedures & APOC apoc , cypher , import	37	1759	July 14, 2020

July Summer Fun!

Json loading is slow after some time

Related topics