Hi all, I am new to Neo4j, and I think I need help speeding up my graph creation. I did alot of 'quirky' things to get Neo4j working the way I want it too, and I think my decisions have finally caught up to me, because graph creation runs dreadfully slow. I am using version 5.2.
I have three different csv files, they are described below
labels_and_ids.csv: has 61,392 entries and are used to create 61,392 nodes. In my application, nodes are uniquely identified by an id and a label. It has the following column headers
node_label, node_id
nodes_and_properties.csv: has 1,017,460 entries. Is used to create 1,017,460 key value pairs in node properties for various nodes, meaning that on average, each node has roughly 16 properties. It has the following column headers
node_id, node_label, property_key, property_value
edges.csv: has 118,047 entries. It is used to create 118,047 relationships between nodes. This file includes edge properties. It has the following column headers (note that "properties" is a json)
head_id, head_label, relationship, tail_id, tail_label, properties
To load in the three CSVs files, I use the following cypher commands
load csv with headers from "file:///labels_and_ids.csv" AS row
CALL apoc.create.node([row.node_label], {id: row.node_id, label: row.node_label})
YIELD node
RETURN count(*)
load csv with headers from "file:///nodes_and_properties.csv" AS row
MATCH (n) WHERE (n.id = row.node_id) AND (n.label = row.node_label)
CALL apoc.create.setProperty(n, row.property_key, row.property_value)
YIELD node
RETURN node;
load csv with headers from "file:///edges.csv" AS row
MATCH (head) WHERE (head.id = row.head_id) AND (head.label = row.head_label)
MATCH (tail) WHERE (tail.id = row.tail_id) AND (tail.label = row.tail_label)
CALL apoc.create.relationship(head, row.relationship, apoc.convert.fromJsonMap(row.properties), tail)
YIELD rel
RETURN rel;
Although I am new to Cypher, I can tell that it has a preference for hardcoded relationships and labels. Unfortunately, I have a few hundred of each, and likely have more on the way, making that approach tedious. Thus, I decided to use APOC to dynamically load things. When I had initially started this project, I (roughly) used the three Cypher commands shown above with about 1/5 the amount of data. That took about 90 minutes in total to run, but when I finally had my graph, everything worked super well! I was able to query it and I discovered several things with important implications for my company. This made me excited to try things out on a full dataset, but after 14 hours it still had not completed the second command (the one where I load in 1,017,460 properties, each property belonging to one of the nodes I had initially created). This tells me that if I wish to scale, I have to use a better approach.
In attempts to speed things up, I used the following commands:
MATCH(n) SET n :_is_node,
CREATE INDEX logical_index FOR (n:_is_node`) ON (n.id, n.label)
The rationale here was to give all nodes a common label, and then index each node by their id and label. I then re-ran the commands above, and it is still taking several hours, and has not completed yet. My first suspicion that my approach was suboptimal was when I felt that I had to give all nodes a common label and had to repeat each nodes label in their property to assist with indexing. Now that things are taking several hours to load in, I know my approach must be suboptimal. Does anyone have any ideas how I may speed this up?
Thank you for your time!