Preferred method to load medium-small and medium amounts of data

I have a very simple MySQL database consisting of only three tables

  • companies
  • products
  • companies_products

Table companies has ~470.000 rows of companies with some adress data, website, categories
Table products has ~46.000 rows and only contains a product name.
Table products_companies is a m:n table that links products to the companies that produce or sell said products. It has ~ 4.900.000 rows (product_id, company_id, indicator of seller, bulk seller or producer)

The upload has to run only once, I don't need updates at the moment. The graph is a very simple (:company)-[:has]->(:product).
What is the preferred or recommended method to do this??

I've been trying to import this data to a local graph using my freshly downloaded Neo4j Desktop. Using the ETL Tool with Online Direct Import and Standard batch sizes it is running now for 24 hours (!!) and completed 390.000 rows.
I tried loading the 4.9 mil relationship rows with a LOAD CSV with CALL IN TRANSACTIONS and it is also running WAY to slow. I tried different IN TRANSACTION OF xy ROWS to no effect.

Heres' the LOAD CSVcode:

:auto LOAD CSV WITH HEADERS FROM 'file:///comp_prod.CSV' AS line
CALL {
with line
MATCH (p:Products {id: toInteger(line.productid)})
MATCH (c:Companies {id: toInteger(line.companyid)})
MERGE (f)-[:has{
seller: toInteger(line.seller),
bulk_seller: toInteger(line.bulk_seller),
producer: toInteger(line.producer),
}]->(b)
} in transactions

Is there a recommendations for batch size? My machine is a moderate powerful i7-7700 with 32 GB of RAM.
Should larger Imports only be made with the graph not started? What am I doing wrong here?

Edit: Its 5.13 on Windows 10 Machine

Do you have indexes defined on the “id” fields? This should greatly help with this query, as each relationship you creates needs to find the product and company nodes. Without them, it will need to do a full node scan twice.

A batch size of 10,000 should be ok.

There is another option:

1 Like

Yes it would totaly make sense to put an index on companies.id and products.id to speed up relationship import. Totally forgot about that, my bad.

But what about the companies initial import from Database to nodes? Would it be faster if the database is not running at the time of import? Is CSV import via neo4j-admin the preferred method for larger amounts of data?