Need help to optimize json load performance

lkevin1210 · January 18, 2024, 7:40am

I have a json file with 3 mil records, I am using apoc.load.json() to read the json records and apply multiple complex match/merge/create logic on it.

My cypher is like belows:

CALL apoc.load.json('test.json.gz') YIELD value

CALL {
  WITH value

  ... very complex logic goes here ...

RETURN *

} IN TRANSACTIONS OF 10000 ROWS

return *;

Since the json file is very large and each line of record contain multiple layers of nested data structure, I have to read it in batches of transaction.

Right now it is taking around 3 hours to complete, may I know what are some best practices in loading large amount of json records?

glilienfield · January 18, 2024, 12:56pm

Maybe post your complex logic so we can review it for performance tips.

lkevin1210 · January 19, 2024, 3:06am

Sure.
Here is one line of record in my json:

{
	"node": {
		"label": "MEMBER",
		"id": [
			{
				"key": "globalcustid",
				"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
			}
		],
		"properties": [
			{
				"key": "globalcustid",
				"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
			},
			{
				"key": "hashmobileno",
				"value": "3b7d3e14562eb421e07e20b15fcc0f954f63362d5ed607a9e9c5d804f4fd2416"
			}
		]
	},
	"relationships": [
		{
			"label": "JOIN",
			"from_node": {
				"label": "MEMBER",
				"id": [
					{
						"key": "globalcustid",
						"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
					}
				]
			},
			"to_node": {
				"label": "MEMBERSHIP",
				"id": [
					{
						"key": "bu",
						"value": "YATA"
					},
					{
						"key": "member_id",
						"value": "M9122805"
					}
				]
			},
			"properties": [
				{
					"key": "registration_datetime",
					"value": "2023-12-06 00:00:00"
				}
			]
		},
		{
			"label": "AGE_IN",
			"from_node": {
				"label": "MEMBER",
				"id": [
					{
						"key": "globalcustid",
						"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
					}
				]
			},
			"to_node": {
				"label": "AGE_GROUP",
				"id": [
					{
						"key": "tag_id",
						"value": "G1100001A03"
					}
				]
			},
			"properties": [
				{
					"key": "createdate",
					"value": "2023-11-29"
				},
				{
					"key": "enddate",
					"value": "9999-12-31"
				},
				{
					"key": "value",
					"value": "1.0"
				}
			]
		},
		{
			"label": "OWN",
			"from_node": {
				"label": "MEMBER",
				"id": [
					{
						"key": "globalcustid",
						"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
					}
				]
			},
			"to_node": {
				"label": "DEVICE",
				"id": [
					{
						"key": "tag_id",
						"value": "G1130002D03"
					}
				]
			},
			"properties": [
				{
					"key": "createdate",
					"value": "2023-12-08"
				},
				{
					"key": "enddate",
					"value": "2025-01-06"
				},
				{
					"key": "value",
					"value": "1.0"
				}
			]
		},
		{
			"label": "LIVE_IN",
			"from_node": {
				"label": "MEMBER",
				"id": [
					{
						"key": "globalcustid",
						"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
					}
				]
			},
			"to_node": {
				"label": "DISTRICT",
				"id": [
					{
						"key": "tag_id",
						"value": "G1100006H06"
					}
				]
			},
			"properties": [
				{
					"key": "createdate",
					"value": "2023-11-29"
				},
				{
					"key": "enddate",
					"value": "9999-12-31"
				},
				{
					"key": "value",
					"value": "1.0"
				}
			]
		},
		{
			"label": "IS_GENDER",
			"from_node": {
				"label": "MEMBER",
				"id": [
					{
						"key": "globalcustid",
						"value": "0000578817c244ad6d25f623d660a1d4c72c0dcd0f35c1f223d43db8ddeefeea"
					}
				]
			},
			"to_node": {
				"label": "GENDER",
				"id": [
					{
						"key": "tag_id",
						"value": "G1100004G01"
					}
				]
			},
			"properties": [
				{
					"key": "createdate",
					"value": "2024-01-01"
				},
				{
					"key": "enddate",
					"value": "9999-12-31"
				},
				{
					"key": "value",
					"value": "1.0"
				}
			]
		}
	]
}

Here is the cypher i used to process and load the record to merge/create different nodes/relationships:

CALL apoc.load.json('file:///test.json.gz') YIELD value

CALL {
  WITH value

  WITH value.node as node, value.relationships as relationships
  WITH toUpper(node.label) as node_label,
  apoc.map.fromLists(
    [entry IN node.id | toLower(entry.key)],
    [entry IN node.id | toLower(entry.value)]
  ) AS node_id, 
  apoc.map.setKey(
    apoc.map.fromLists(
      [entry IN node.properties | toLower(entry.key)],
      [entry IN node.properties | toLower(entry.value)]
    ),
    'last_sys_upd_dt',
    datetime({epochMillis: apoc.date.currentTimestamp()}) + duration({hours: 8})
  ) AS node_properties, relationships

  // Create member node
  CALL apoc.merge.node(
    [node_label],
    node_id,
    node_properties,
    node_properties
  ) YIELD node as member_node

  // Unnest relationships and process each of them
  WITH member_node, relationships
  UNWIND relationships as relationship
  
  WITH member_node, relationship 
  WHERE relationship IS NOT NULL

  WITH member_node, 
  toUpper(relationship.label) as rel_label,
  relationship.from_node as from_node,
  relationship.to_node as to_node,
  relationship.properties as rel_properties

  WITH member_node, 
  rel_label, 
  apoc.map.fromLists(
    [entry IN from_node.id | toLower(entry.key)],
    [entry IN from_node.id | toLower(entry.value)]
  ) AS from_node_id, 
  toUpper(from_node.label) as from_node_label,
  apoc.map.fromLists(
    [entry IN to_node.id | toLower(entry.key)],
    [entry IN to_node.id | toLower(entry.value)]
  ) AS to_node_id, 
  toUpper(to_node.label) as to_node_label,
  CASE 
  WHEN rel_properties IS NOT NULL THEN
  apoc.map.setKey(
    apoc.map.fromLists(
      [entry IN rel_properties | toLower(entry.key)],
      [entry IN rel_properties | toLower(entry.value)]
    ),
    'last_sys_upd_dt',
    datetime({epochMillis: apoc.date.currentTimestamp()}) + duration({hours: 8})
  ) 
  ELSE {
    last_sys_upd_dt: datetime({epochMillis: apoc.date.currentTimestamp()}) + duration({hours: 8})
  }
  END AS rel_properties
  
  // Special handling of relationship properties data types
  WITH member_node, rel_label, 
  from_node_id, from_node_label, 
  to_node_id, to_node_label,
  apoc.map.setValues(
	rel_properties,
	[
		"createdate", date(rel_properties.createdate),
		"enddate", date(rel_properties.enddate),
		"registration_datetime", CASE
			WHEN rel_properties.registration_datetime IS NOT NULL
			THEN datetime({epochmillis: apoc.date.parse(rel_properties.registration_datetime, "ms", "yyyy-MM-dd hh:mm:ss")})
			ELSE null
		END,
		"value", toFloat(rel_properties.value)
	]
  ) as rel_properties

  // Skip those relationship if no from_node_id or to_node_id
  WHERE from_node_id IS NOT NULL and to_node_id IS NOT NULL

  // Match from node
  CALL apoc.merge.node(
    [from_node_label],
    from_node_id,
    {last_sys_upd_dt: datetime({epochMillis: apoc.date.currentTimestamp()}) + duration({hours: 8})},
    {}
  ) YIELD node as from_node

  // Match to node
  CALL apoc.merge.node(
    [to_node_label],
    to_node_id,
    {last_sys_upd_dt: datetime({epochMillis: apoc.date.currentTimestamp()}) + duration({hours: 8})},
    {}
  ) YIELD node as to_node

  // Create relationship
  CALL apoc.merge.relationship(
    from_node,
    rel_label,
    {},
    rel_properties,
    to_node,
    {}
  ) YIELD rel as member_node_rel

  RETURN *

} IN TRANSACTIONS OF 10000 ROWS

return member_node, member_node_rel;

Appreciate it if you would give me some direction to optimization it.

glilienfield · January 19, 2024, 3:19am

Sure, I can review tomorrow. Is this one element in a list of json objects or the file contains only the json you provided.

Ok, rereading your original post, do you have millions of objects like this to parse?

lkevin1210 · January 19, 2024, 4:32am

Thanks for the reply

This is one element in a list of json objects, I have ~3 mil record like this.

As shown, each object represents a MEMBER node (uniquely identified by the globalcustid), and the relationships that this node related to.

dana_canzano · January 19, 2024, 12:39pm

@lkevin1210

given a number of apoc.merge.node() do you have indexes on said label/property to which the merge is against.


For performance reasons, creating a schema index on the label 
or property is highly recommended when using MERGE. See 
Indexes for search performance for more information.

The link above is from the Neo4j v5 documentation. Are you using Neo4j v5? Do you know the version?

glilienfield · January 19, 2024, 5:09pm

I reviewed your complex code. It seems like you approached it rather reasonably.

I fully agree with @dana_canzano about ensuring you have indexes on the "id" properties used to match records during the merge. As a note, the "id" property you are extracting as a list and then passing to the apoc.merge.node procedure are used as the identifying properties for the match. These are represented in your code as: node_id, from_node_id, and to_node_id. Some of your id properties lists have two elements. Both of these will be used for the match. Also, you don't have a consistent set of keys across all the nodes, so you will need multiple indexes to ensure the merges use an index to match on.

I don't have experience parsing a json object of this size. This alone could be adding time to your execution performance, as parsing json is not free. Have you baselined the time to just parse using a json library, just to get an idea? I see the Jackson json library as a dependency in the Neo4j Java API I use for custom procedures, so they use Jackson. Apoc probably leverages the same since it is already included.

Do you need all the toLower() method calls, as the strings in your json all look lower case already. This unnecessary string processing could be adding execution time.

This can be moved up to before you extract and set the relationship properties if either the to/from nodes are missing. Why would they be missing if the relationship is in the data?

  // Skip those relationship if no from_node_id or to_node_id
  WHERE from_node_id IS NOT NULL and to_node_id IS NOT NULL

lkevin1210 · January 23, 2024, 6:25am

Thanks for the reply

I do have indexes created on the from_node_id/to_node_id in each of the node.

I tested with the following and the loading time is almost the same:

use a less nested (i.e. flatten) structure json as input to reduce parsing time
cast lower/upper case from upstream while generating the json file to reduce string processing time
perform id NULL checking from upstream to reduce the checking time in neo4j

So in my case, I may need to consider to perform daily incremental load instead of a full load of 3mil records...

glilienfield · January 23, 2024, 12:54pm

The node_id, to_node_, and from_node_id are lists you have extracted from the “id” property of each node. You are passing each to the apoc.merge.node method as the “identification” parameter. These are use to match on an existing node, otherwise the node is created and the key/value pairs are added to the other properties set by the other parameters. It is the keys in these lists that need indexes for you to have fast matches when merging. Looking at your data, here are few of the labels and properties that are in the “id” attribute:

Label, property
MEMBER, globalcustid
MEMBERSHIP, bu
MEMBERSHIP, member_id
AGE_GROUP, tag_id
DEVICE, tag_id
DISTRICT, tag_id
GENDER, tag_id

All of these properties are used as search keys, so you need indexes for each. Note, the MEMBERSHIP label uses both concurrently, so a composite index may be needed. Or, can you remove the “bu” property from the node_id value?

lkevin1210 · January 25, 2024, 2:16am

Yes, it is correct, those are the keys of my nodes, and I have already created index on them.

Right now I am exploring other ways to do a daily incremental load with ~1000 records and only do the 3 hours full reload in weekend.

Thank so much for your suggestion though.

glilienfield · January 25, 2024, 3:44am

That sounds very logical.

henry_bowers · February 8, 2024, 4:33am

At EA we use JSON file formats as preferred import method - new and existing databases - all import data sizes. For larger import jobs I generally split the graph updates into multiple procedure calls. For example, 1 create/merge procedure per node type or rel type. I still create complex JSON structures, with intent to minimize number of import files that have to be generated. In most of our use cases, I can use the same nested JSON file across procedures, plucking the relevant components needed for each import procedure with nested UNWINDs (I think you are already doing something similar).

Glancing at your code, it seems that splitting your MERGE/CREATE/SET clauses to reduce operations per APOC procedure should be relatively easy (I hope).

Topic		Replies	Views
Json loading is slow after some time Import / Export	19	2270	October 25, 2020
How to speed up apoc json load Procedures & APOC apoc , performance , import	9	801	October 18, 2021
Load large json file with new lines as seperator Procedures & APOC json	5	2397	February 15, 2019
How to speed up JSON data load? Neo4j Graph Platform memory , apocperiodiciterate	7	256	February 17, 2023
Importing JSON to Neo4j from a file containing list of JSONs Import / Export apoc , import , json	7	7131	November 12, 2019

Need help to optimize json load performance

Related topics