Do I need to load the CSV files twice for creating relationships?

For the northwind CSV loading example, it seems that it first creats the nodes by reading from CSV file once: Tutorial: Import Relational Data Into Neo4j - Developer Guides

// Create orders
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
MERGE (order:Order {orderID: row.OrderID})
  ON CREATE SET order.shipName = row.ShipName;

// Create products
LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS row
MERGE (product:Product {productID: row.ProductID})
  ON CREATE SET product.productName = row.ProductName, product.unitPrice = toFloat(row.UnitPrice);

After loading all csv files, it starts to create relationships by loading CSVs again:

/ Create relationships between orders and products
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (product:Product {productID: row.ProductID})
MERGE (order)-[op:CONTAINS]->(product)
  ON CREATE SET op.unitPrice = toFloat(row.UnitPrice), op.quantity = toFloat(row.Quantity);

// Create relationships between orders and employees
LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (employee:Employee {employeeID: row.EmployeeID})
MERGE (employee)-[:SOLD]->(order);

Is this the normal way to building a graph from CSV files? It would be great if it only needs to load once. Just curious.

Yes, you can perform all operations in one load. There may be some performance issues.

Hi,

When you are loading transaction data (OLTP) it will be based on the application, combination of Node and Relationship and then Node etc.

However when you are loading OLAP system for reporting then better to load all the dimensions(nodes) and then fact (relationships)

Can you be more specific on this:

based on the application, combination of Node and Relationship and then Node etc

One thing is very much clear that existence of relationship is based on the presence of relevant Nodes. However it is not mandatory to have all Nodes need to available and then execute the relationship.
And same is done in the example you have mentioned.

Let's say if I have table like this:

product.csv
`productID, productName, productPrice, OrderID`

In order to create relationships between Product and Order, since both Order and Product nodes needs to be created first and Order won't be available since Order info is stored in Order.csv, nodes have to be created first, and then relationships in loading csv.

On the other hand, if the csv is this:
product_order.csv
productID, productName, productPrice, orderID, orderAmount, orderAddress

Since both Order and Product are in the same line, both nodes and relationships can be created at the same time, which is different from the first case.

Can you explain 'When you are loading transaction data (OLTP) it will be based on the application, combination of Node and Relationship and then Node etc' with regards to the two cases above? OLTP loading is very different from command-line tool loading?

Case 1: Product.csv, Order.csv-->
Case 2: Product_Order.csv-->
In both the above mention case it is always good practice to load Node first and then load relationship. However in several cases that will be bit costly(case1) as same file needs to be traversed few extra times.

OLTP insertion comes from the application you do not load via csv, until unless you load offline. And that insertion is application specific. But ultimately we cannot create relationship until unless we have related Nodes.

We are importing data from Oracle database and loading into Neo4j with nodes, relationships and some other data all in one go. Also we do incremental updates.

@ameyasoft Can you share how do you organize the nodes and relationships and how to manage loading in one go?

I am in a loop to load each CSV file, and for each file, I am calling this 'update_nodes' function to merge nodes and properties. The 'property_dict' stores all property key-value pairs of each node. Since all the nodes, property keys and values are variables, I have to concatenate strings to produce the cypher queries. This works, but it seems quite slow, since it has to set each property individually in a loop. I saw the cypher parameters and the apoc dynamic creation of nodes or properties, but doesn't think they work in my case.

@classmethod
    def update_nodes(cls, tx, property_dict, label):
        for key, value in property_dict.items():
            cypher_command = "MERGE (n: " + label + "{ID: '" + property_dict[
                'ID'] + "'}) ON MATCH SET n." + key + " = \"" + value + "\""
            tx.run(cypher_command)

Is there a way to optimize this code to make it much faster?

Create a view in Oracle that generates say customer id, customer name, order id, ....
and used Neo4j driver and java to import into Neo4j db.

Is there a way to optimize this code anyway? @neo4j Team

@classmethod
    def update_nodes(cls, tx, property_dict, label):
        for key, value in property_dict.items():
            cypher_command = "MERGE (n: " + label + "{ID: '" + property_dict[
                'ID'] + "'}) ON MATCH SET n." + key + " = \"" + value + "\""
            tx.run(cypher_command)