I switched to using CALL apoc.periodic.iterate and it did help somewhat, but eventually, as the size of the nodes grew, i find that loading json files also grow exponentially:
execution (in seconds) for each batch of json files to ingest:
6415
7496
8179
12389
14385
I have also created indexes to help speed up the ingestion:
// create index to speed up load json operations
create index for (m:Author) on (m.First,m.Last,m.Orcid)
create index for (m:Institution) on (m.Name)
create index for (m:Publication) on (m.DOI)
Here is the main code inside the apoc.periodic.iterate:
//load json table into NEO4j
CALL apoc.load.json(‘file:///list_data_0.json’) yield value
UNWIND value.items as insti WITH insti, insti.reference as references
UNWIND references as refs WITH insti, insti.author as authorname, count(refs) as cntref
UNWIND authorname as authors WITH insti, cntref, count(authors) as cntauthors
// Create publication
MERGE (p:Publication{
Count: insti.is-referenced-by-count
,
DateTime: insti.indexed.date-time
,
IndexYear:insti.indexed.date-parts
[0][0],
IndexMonth:insti.indexed.date-parts
[0][1],
IndexDay:insti.indexed.date-parts
[0][2],
Prefix:coalesce(insti.prefix,’NONE’),
DateDeposited:insti.deposited.date-time
,
Type:coalesce(insti.type,’NONE’),
Title:coalesce(insti.title,’NONE’),
URL:coalesce(insti.URL,’NONE’),
Score: coalesce(insti.score,’NONE’),
ContainerTitle:coalesce(insti.container-title
,’NONE’),
Restrictions:coalesce(insti.content-domain
.crossmark-restriction
,’NONE’),
Member:coalesce(insti.member,’NONE’),
DOI:coalesce(insti.DOI,’NONE’),Language:coalesce(insti.language,’NONE’),
Issntype: coalesce(insti.issn-type
[0].type,’NONE’),
IssnValue: coalesce(insti.issn-type
[0].value,’NONE’),
LinkURL:coalesce( insti.link[0].URL,’NONE’),
Subject:coalesce(insti.subject,’NONE’),
IntendedApplication:coalesce(insti.link[0].intended-application
,’NONE’),
Publisher:coalesce(insti.publisher,’NONE’),
ContainerTitle:coalesce(insti.container-title
,’NONE’)})
// create the references per publication
FOREACH (g in range(0,cntref) |
MERGE (k:Publication{key:coalesce(insti.reference[g].key,’NONE’), DOI:coalesce(insti.reference[g].DOI,’NONE’)})
MERGE (p)-[:References]->(k)
)
// create the authors node
// The foreach loop will cycle through all the authors listed in the paper
FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,’NONE’),Last:coalesce(insti.author[l].family,’NONE’),Orcid:coalesce(insti.author[l].ORCID,’NONE’)})
//create the affiliation institution
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,’NONE’)})
// Create the link where Author (a) belongs to Institution (i)
MERGE (a)-[r:BelongsTo]->(i)
// create the author to paper link
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,’NONE’)}]->(p)
)
Any help with further optimization is greatly appreciated. Thanks