How to speed up apoc json load

Hi,

Newbie here. For a story I'm trying to import 330.000 json documents on academic papers, with relationships between papers, authors, institutions, funding and patents. I'm using apoc load json for this, but it's terribly slow, it's starts out pretty slow already, but crawls almost to a halt. A 125M json file with 12.000 records takes about 16 hours on my 16 GB RAM machine.

What I have tried:

  • off course looked extensively on this forum
  • changed the config and allocated 10GB+ memory to Neo4j Desktop (version 1.4.5).
  • Using periodic iterate and importing smaller batches (even 500 records)
  • Using constraints and indexes on merging properties, except for authors, because they aren't unique.
  • Tried Profile to find the bottleneck, but with 400.000 nodes in, I am at a point that even that takes a very long time to execute.

Coming from Python I assume that my code is slow, using multiple FOREACH loops. Is there any way to make this code more efficient? Or am I missing something else entirely?


This is my query:

UNWIND ['file:///CH.json'] AS filename
CALL apoc.load.json(filename) YIELD value as v

MERGE (p:Paper {title: v.title}) ON CREATE SET p.abstract = COALESCE(v.abstract, 'NULL'),
        p.lensId =  v.lens_id,
        p.datePublished = COALESCE(v.date_published.date, 'NULL'),
        p.publicationType = COALESCE(v.publication_type, 'NULL'),
        p.scholarlyCitationsCount = v.scholarly_citations_count,
        p.patentCitationsCount = v.patent_citations_count

FOREACH (fund in v.funding | 
MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET f.country = COALESCE(fund.country, 'NULL'),
        f.fundingId = COALESCE(fund.funding_id, 'NULL')
MERGE (f)-[:FUNDED]->(p)) 
FOREACH (author in v.authors | 
MERGE (a:Author {name: COALESCE(author.first_name + ' ' + author.initials + ' ' + author.last_name, 'NULL')}) 
    ON CREATE SET a.firstName = author.first_name,
    a.initials = author.initials,
    a.lastName = author.last_name,
    a.nameInstitution = COALESCE(author.first_name + '_' + author.initials + '_' + author.last_name + '_' + author.affiliations[0].name, 'NULL')
MERGE (a)-[:AUTHORED]->(p)

FOREACH (affiliation in author.affiliations |
MERGE (i:Institution {name: COALESCE(affiliation.name, 'NULL')}) 
    ON CREATE SET i.gridID = COALESCE(affiliation.grid.id, 'NULL')
MERGE (a)-[:WORKS_AT]->(i)))``` 

The JSON records are like this:

{
        "abstract": "Exogenous application of methyl jasmonate (MeJA) has been extensively used to study jasmonate-dependent signaling events triggered by biotic stresses. MeJA application leads to complex jasmonate-dependent physiological responses, including changes in stomatal openness and induction of emissions of a multitude of volatile compounds. Whether the alterations in stomatal conductance and emissions of MeJA-induced volatiles are quantitatively associated with MeJA dose, and whether the induced volatile emissions are regulated by modifications in stomatal conductance had been poorly known until recently. Our latest studies highlighted a biphasic kinetics of jasmonate-dependent volatile emissions induced by MeJA treatment in the model species cucumber (Cucumis sativus), indicating induction of an immediate stress response and subsequent gene-expression level response. Both the immediate and delayed responses were MeJA dose-dependent. The studies further demonstrated that stomata modulated the kinetics of emissions of water-soluble volatiles in a MeJA dose-dependent manner. These studies contribute to understanding of plant short- and long-term responses to different biotic stress severities as simulated by treatments with a range of MeJA doses corresponding to mild to acute stress.",
        "authors":
        [
            {
                "affiliations":
                [
                    {
                        "grid":
                        {
                            "addresses":
                            [
                                {
                                    "country_code": "CN"
                                }
                            ],
                            "id": "grid.27871.3b"
                        },
                        "name": "Nanjing Agricultural University"
                    },
                    {
                        "grid":
                        {
                            "addresses":
                            [
                                {
                                    "country_code": "EE"
                                }
                            ],
                            "id": "grid.16697.3f"
                        },
                        "name": "Estonian University of Life Sciences"
                    }
                ],
                "collective_name": null,
                "first_name": "Yifan",
                "initials": "Y",
                "last_name": "Jiang"
            },
            {
                "affiliations":
                [
                    {
                        "grid":
                        {
                            "addresses":
                            [
                                {
                                    "country_code": "EE"
                                }
                            ],
                            "id": "grid.16697.3f"
                        },
                        "name": "Estonian University of Life Sciences"
                    }
                ],
                "collective_name": null,
                "first_name": "Jiayan",
                "initials": "J",
                "last_name": "Ye"
            },
            {
                "affiliations":
                [
                    {
                        "grid":
                        {
                            "addresses":
                            [
                                {
                                    "country_code": "EE"
                                }
                            ],
                            "id": "grid.418882.f"
                        },
                        "name": "Estonian Academy of Sciences"
                    },
                    {
                        "grid":
                        {
                            "addresses":
                            [
                                {
                                    "country_code": "EE"
                                }
                            ],
                            "id": "grid.16697.3f"
                        },
                        "name": "Estonian University of Life Sciences"
                    }
                ],
                "collective_name": null,
                "first_name": "Ülo",
                "initials": "Ü",
                "last_name": "Niinemets"
            }
        ],
        "chemicals":
        [],
        "date_published":
        {
            "date": "2021-04-21T00:00:00+00:00"
        },
        "end_page": null,
        "fields_of_study":
        [
            {
                "name": "Biophysics"
            },
            {
                "name": "Biotic stress"
            }
        ],
        "funding":
        [
            {
                "country": null,
                "funding_id": null,
                "org": "European Regional Development Fund"
            },
            {
                "country": null,
                "funding_id": null,
                "org": "by European Commission through the European Research Council"
            },
            {
                "country": null,
                "funding_id": null,
                "org": "by European Commission through the European Research Council"
            }
        ],
        "issue": "7",
        "keywords":
        [
            {
                "value": "Dose–response"
            },
            {
                "value": "MeJA"
            },
            {
                "value": "green leaf volatiles"
            },
            {
                "value": "lipoxygenase pathway volatiles"
            }
        ],
        "lens_id": "000-420-903-932-767",
        "mesh_terms":
        [],
        "patent_citations_count": 0,
        "publication_type": "journal article",

So I'm trying to create relationships between all entities in the records. Each records is a paper with, most of the time, several authors with (sometimes multiple) affiliations, funding, etc.

Some pointers would be greatly appreciated.

Hi Dimitry
Please check the reference from feb 2020 similar to the case you describe.

Regards

Thanks Refeaeli, I did read that and tried some of the suggestions. One of the cases is pretty similar to mine and I rewrote part of the query, but to no avail.

what indexes are on
:Paper
:Funding
:Author
:Institution

Hi Dana,

I see that I have one duplicate index on author, but here's the rest.

authorInstitutionNameIndex	BTREE	NONUNIQUE	NODE	[ "Author" ]	[ "nameInstitution" ]	ONLINE
authors	BTREE	NONUNIQUE	NODE	[ "Author" ]	[ "name", "nameInstitution" ]	ONLINE
constraint_1439150b	BTREE	UNIQUE	NODE	[ "Paper" ]	[ "title" ]	ONLINE
constraint_350d222c	BTREE	UNIQUE	NODE	[ "Institution" ]	[ "name" ]	ONLINE
constraint_41e068e1	BTREE	UNIQUE	NODE	[ "Patent" ]	[ "patentLensId" ]	ONLINE
constraint_8fed31ab	BTREE	UNIQUE	NODE	[ "Funding" ]	[ "fundingOrg" ]	ONLINE
constraint_c18c113c	BTREE	UNIQUE	NODE	[ "Paper" ]	[ "lensId" ]	ONLINE
constraint_e20ec68d	BTREE	UNIQUE	NODE	[ "Institution" ]	[ "gridId" ]	ONLINE
institutions	BTREE	NONUNIQUE	NODE	[ "Institution" ]	[ "name", "gridId" ]	ONLINE
papers	BTREE	NONUNIQUE	NODE	[ "Paper" ]	[ "title", "lensId" ]	ONLINE
patents	BTREE	NONUNIQUE	NODE	[ "Patent" ]	[ "title", "lensId" ]	ONLINE

Constraints

ON ( paper:Paper ) ASSERT (paper.title) IS UNIQUE
ON ( institution:Institution ) ASSERT (institution.name) IS UNIQUE
ON ( patent:Patent ) ASSERT (patent.patentLensId) IS UNIQUE
ON ( funding:Funding ) ASSERT (funding.fundingOrg) IS UNIQUE
ON ( paper:Paper ) ASSERT (paper.lensId) IS UNIQUE
ON ( institution:Institution ) ASSERT (institution.gridId) IS UNIQUE

@dimitri.tokmetzis

unless mistaken I'm not seeing an index on :Funding(name) which would be of benefit for MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET ... ...

Yes, that's true, thank you. I'm trying that now, but it still is very slow....

If you have all your indexes correct and that is not what is slowing down the merges, then I would think the problem is you are bringing the entire JSON into memory and committing it in one go. To that end I would recommend you use apoc.periodic.iterate to create batches.

so something like:

CALL apoc.periodic.iterate(
"UNWIND ['file:///CH.json'] AS filename
CALL apoc.load.json(filename) YIELD value as v",
"MERGE (p:Paper {title: v.title}) ON CREATE SET p.abstract = COALESCE(v.abstract, 'NULL'),
        p.lensId =  v.lens_id,
        p.datePublished = COALESCE(v.date_published.date, 'NULL'),
        p.publicationType = COALESCE(v.publication_type, 'NULL'),
        p.scholarlyCitationsCount = v.scholarly_citations_count,
        p.patentCitationsCount = v.patent_citations_count

   FOREACH (fund in v.funding | 
MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET f.country = COALESCE(fund.country, 'NULL'),
        f.fundingId = COALESCE(fund.funding_id, 'NULL')
MERGE (f)-[:FUNDED]->(p)) 
FOREACH (author in v.authors | 
MERGE (a:Author {name: COALESCE(author.first_name + ' ' + author.initials + ' ' + author.last_name, 'NULL')}) 
    ON CREATE SET a.firstName = author.first_name,
    a.initials = author.initials,
    a.lastName = author.last_name,
    a.nameInstitution = COALESCE(author.first_name + '_' + author.initials + '_' + author.last_name + '_' + author.affiliations[0].name, 'NULL')
MERGE (a)-[:AUTHORED]->(p)

FOREACH (affiliation in author.affiliations |
MERGE (i:Institution {name: COALESCE(affiliation.name, 'NULL')}) 
    ON CREATE SET i.gridID = COALESCE(affiliation.grid.id, 'NULL')
MERGE (a)-[:WORKS_AT]->(i))) "
,{batchSize:100, parallel:false})
1 Like

Dimitri,

I would try creating a temporary node for each document with all relevant JSON fields imported as properties of the node. Once the data is imported into neo4j, it's much faster to perform the parsing functions you have listed to create the specific nodes you need rather than performing these functions as you import the data.

thanks for the suggestion.
I discovered a small error in my index and decided to load the json through python and that worked a lot quicker. In the end, it still took 7 hours, but that is manageble for now. I will try your suggestion to shave off some extra hours though.