How to speed up JSON data load?

Anita_Lavania · February 16, 2023, 5:04am

Hi,

I am trying to load 200K+ nodes (the Yelp user dataset broken down into multiple JSON files) from the Neo4j Desktop into local a local database. I have tried to increase the memory for this process by modifying the neo4j.conf file as follows:

dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=4G
dbms.memory.pagecache.size=1G

With initial heap size anything more than 0.5G, the DBMS won't start.

I am also using the apoc.periodic.iterate procedure:

CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///yelp/yelp_user_1.json') YIELD value AS user",
"WITH user.user_id AS user_id, user.friends_list AS friends_list MERGE (u:User {ID: user_id}) WITH u, friends_list UNWIND friends_list AS friend MERGE (u2:User {ID: friend}) MERGE (u)-[:friends_with]->(u2)",
{batchSize:5000})

I calculated the extrapolated time for this load for 5000 rows alone is 16 hours. How do I improve this? Is there anything I can do in the conf file or use another function that would speed up this load?

Thanks,
Anita

glilienfield · February 16, 2023, 12:33pm

Do you have an index for User(ID)?

Anita_Lavania · February 16, 2023, 2:08pm

No, I have not used indexing. But from what I understand, indexing is only going to speed up queries later on, not this data load. Isn't that right?

glilienfield · February 16, 2023, 2:13pm

That is not accurate. The merge operation will first try to match the node, and create it if it is not found. As such, the index will be utilized. Your statement would be accurate if you used create instead of merge.

Anita_Lavania · February 16, 2023, 4:34pm

Thanks Gary.

That makes sense. However, I have a question. Like in SQL databases, if a modification is made in the data (e.g. adding a new 'friend ID' in the "friends" column), the index table will have to be updated as well.

Is there an equivalent to that in graph DB in a similar modification scenario?
For e.g., suppose an indexed User_ID "id_4" had relationship [:friends_with] with User_ID's "id_5" and "id_23". Later, "id_57" is also "friends_with" "id_4". Will there be a need for update in the indexes of User_ID nodes?

Thanks in advance.
Anita

glilienfield · February 16, 2023, 7:03pm

The indexes will be updated whenever you alter that property for a node with the label.

Relationships are manually managed by you; they are actual relationships in the data file. It is not the same as in a relational database that uses primary and foreign keys to derive the relationship between tables in realtime.

If a new User node is created with id=57 and it should be related with another User node with id=4, then the relationships needs to be created. The index on User(id) will be updated when the new User node is added.

Anita_Lavania · February 17, 2023, 4:03am

So indexing should not be done on labels for which new data is frequently appended?

glilienfield · February 17, 2023, 4:26am

I would create indexes on labels/properties that improve your query performance. To me that is the most important. The updating of indexing as data is change is necessity of having an index. As such, you add indexes for the stuff you need, not every label/property combination.

Topic		Replies	Views
Json loading is slow after some time Import / Export	19	2253	October 25, 2020
Neo4j on Ubuntu Operations performance	3	900	May 19, 2019
Need help to optimize json load performance Neo4j Graph Platform	11	300	February 8, 2024
How to speed up apoc json load Procedures & APOC apoc , performance , import	9	801	October 18, 2021
Data Loading Desktop performance	1	209	October 14, 2021

How to speed up JSON data load?

Related topics