Hi,
Newbie here. For a story I'm trying to import 330.000 json documents on academic papers, with relationships between papers, authors, institutions, funding and patents. I'm using apoc load json for this, but it's terribly slow, it's starts out pretty slow already, but crawls almost to a halt. A 125M json file with 12.000 records takes about 16 hours on my 16 GB RAM machine.
What I have tried:
- off course looked extensively on this forum
- changed the config and allocated 10GB+ memory to Neo4j Desktop (version 1.4.5).
- Using periodic iterate and importing smaller batches (even 500 records)
- Using constraints and indexes on merging properties, except for authors, because they aren't unique.
- Tried Profile to find the bottleneck, but with 400.000 nodes in, I am at a point that even that takes a very long time to execute.
Coming from Python I assume that my code is slow, using multiple FOREACH loops. Is there any way to make this code more efficient? Or am I missing something else entirely?
This is my query:
UNWIND ['file:///CH.json'] AS filename
CALL apoc.load.json(filename) YIELD value as v
MERGE (p:Paper {title: v.title}) ON CREATE SET p.abstract = COALESCE(v.abstract, 'NULL'),
p.lensId = v.lens_id,
p.datePublished = COALESCE(v.date_published.date, 'NULL'),
p.publicationType = COALESCE(v.publication_type, 'NULL'),
p.scholarlyCitationsCount = v.scholarly_citations_count,
p.patentCitationsCount = v.patent_citations_count
FOREACH (fund in v.funding |
MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET f.country = COALESCE(fund.country, 'NULL'),
f.fundingId = COALESCE(fund.funding_id, 'NULL')
MERGE (f)-[:FUNDED]->(p))
FOREACH (author in v.authors |
MERGE (a:Author {name: COALESCE(author.first_name + ' ' + author.initials + ' ' + author.last_name, 'NULL')})
ON CREATE SET a.firstName = author.first_name,
a.initials = author.initials,
a.lastName = author.last_name,
a.nameInstitution = COALESCE(author.first_name + '_' + author.initials + '_' + author.last_name + '_' + author.affiliations[0].name, 'NULL')
MERGE (a)-[:AUTHORED]->(p)
FOREACH (affiliation in author.affiliations |
MERGE (i:Institution {name: COALESCE(affiliation.name, 'NULL')})
ON CREATE SET i.gridID = COALESCE(affiliation.grid.id, 'NULL')
MERGE (a)-[:WORKS_AT]->(i)))```
The JSON records are like this:
{
"abstract": "Exogenous application of methyl jasmonate (MeJA) has been extensively used to study jasmonate-dependent signaling events triggered by biotic stresses. MeJA application leads to complex jasmonate-dependent physiological responses, including changes in stomatal openness and induction of emissions of a multitude of volatile compounds. Whether the alterations in stomatal conductance and emissions of MeJA-induced volatiles are quantitatively associated with MeJA dose, and whether the induced volatile emissions are regulated by modifications in stomatal conductance had been poorly known until recently. Our latest studies highlighted a biphasic kinetics of jasmonate-dependent volatile emissions induced by MeJA treatment in the model species cucumber (Cucumis sativus), indicating induction of an immediate stress response and subsequent gene-expression level response. Both the immediate and delayed responses were MeJA dose-dependent. The studies further demonstrated that stomata modulated the kinetics of emissions of water-soluble volatiles in a MeJA dose-dependent manner. These studies contribute to understanding of plant short- and long-term responses to different biotic stress severities as simulated by treatments with a range of MeJA doses corresponding to mild to acute stress.",
"authors":
[
{
"affiliations":
[
{
"grid":
{
"addresses":
[
{
"country_code": "CN"
}
],
"id": "grid.27871.3b"
},
"name": "Nanjing Agricultural University"
},
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.16697.3f"
},
"name": "Estonian University of Life Sciences"
}
],
"collective_name": null,
"first_name": "Yifan",
"initials": "Y",
"last_name": "Jiang"
},
{
"affiliations":
[
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.16697.3f"
},
"name": "Estonian University of Life Sciences"
}
],
"collective_name": null,
"first_name": "Jiayan",
"initials": "J",
"last_name": "Ye"
},
{
"affiliations":
[
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.418882.f"
},
"name": "Estonian Academy of Sciences"
},
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.16697.3f"
},
"name": "Estonian University of Life Sciences"
}
],
"collective_name": null,
"first_name": "Ülo",
"initials": "Ü",
"last_name": "Niinemets"
}
],
"chemicals":
[],
"date_published":
{
"date": "2021-04-21T00:00:00+00:00"
},
"end_page": null,
"fields_of_study":
[
{
"name": "Biophysics"
},
{
"name": "Biotic stress"
}
],
"funding":
[
{
"country": null,
"funding_id": null,
"org": "European Regional Development Fund"
},
{
"country": null,
"funding_id": null,
"org": "by European Commission through the European Research Council"
},
{
"country": null,
"funding_id": null,
"org": "by European Commission through the European Research Council"
}
],
"issue": "7",
"keywords":
[
{
"value": "Dose–response"
},
{
"value": "MeJA"
},
{
"value": "green leaf volatiles"
},
{
"value": "lipoxygenase pathway volatiles"
}
],
"lens_id": "000-420-903-932-767",
"mesh_terms":
[],
"patent_citations_count": 0,
"publication_type": "journal article",
So I'm trying to create relationships between all entities in the records. Each records is a paper with, most of the time, several authors with (sometimes multiple) affiliations, funding, etc.
Some pointers would be greatly appreciated.