I am trying to load 200K+ nodes (the Yelp user dataset broken down into multiple JSON files) from the Neo4j Desktop into local a local database. I have tried to increase the memory for this process by modifying the neo4j.conf file as follows:
With initial heap size anything more than 0.5G, the DBMS won't start.
I am also using the apoc.periodic.iterate procedure:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///yelp/yelp_user_1.json') YIELD value AS user",
"WITH user.user_id AS user_id, user.friends_list AS friends_list MERGE (u:User {ID: user_id}) WITH u, friends_list UNWIND friends_list AS friend MERGE (u2:User {ID: friend}) MERGE (u)-[:friends_with]->(u2)",
{batchSize:5000})
I calculated the extrapolated time for this load for 5000 rows alone is 16 hours. How do I improve this? Is there anything I can do in the conf file or use another function that would speed up this load?
That is not accurate. The merge operation will first try to match the node, and create it if it is not found. As such, the index will be utilized. Your statement would be accurate if you used create instead of merge.
That makes sense. However, I have a question. Like in SQL databases, if a modification is made in the data (e.g. adding a new 'friend ID' in the "friends" column), the index table will have to be updated as well.
Is there an equivalent to that in graph DB in a similar modification scenario?
For e.g., suppose an indexed User_ID "id_4" had relationship [:friends_with] with User_ID's "id_5" and "id_23". Later, "id_57" is also "friends_with" "id_4". Will there be a need for update in the indexes of User_ID nodes?
The indexes will be updated whenever you alter that property for a node with the label.
Relationships are manually managed by you; they are actual relationships in the data file. It is not the same as in a relational database that uses primary and foreign keys to derive the relationship between tables in realtime.
If a new User node is created with id=57 and it should be related with another User node with id=4, then the relationships needs to be created. The index on User(id) will be updated when the new User node is added.
I would create indexes on labels/properties that improve your query performance. To me that is the most important. The updating of indexing as data is change is necessity of having an index. As such, you add indexes for the stuff you need, not every label/property combination.