Preferred method to load medium-small and medium amounts of data

brcg · November 23, 2023, 9:09am

I have a very simple MySQL database consisting of only three tables

companies
products
companies_products

Table companies has ~470.000 rows of companies with some adress data, website, categories
Table products has ~46.000 rows and only contains a product name.
Table products_companies is a m:n table that links products to the companies that produce or sell said products. It has ~ 4.900.000 rows (product_id, company_id, indicator of seller, bulk seller or producer)

The upload has to run only once, I don't need updates at the moment. The graph is a very simple (:company)-[:has]->(:product).
What is the preferred or recommended method to do this??

I've been trying to import this data to a local graph using my freshly downloaded Neo4j Desktop. Using the ETL Tool with Online Direct Import and Standard batch sizes it is running now for 24 hours (!!) and completed 390.000 rows.
I tried loading the 4.9 mil relationship rows with a LOAD CSV with CALL IN TRANSACTIONS and it is also running WAY to slow. I tried different IN TRANSACTION OF xy ROWS to no effect.

Heres' the LOAD CSVcode:

:auto LOAD CSV WITH HEADERS FROM 'file:///comp_prod.CSV' AS line
CALL {
with line
MATCH (p:Products {id: toInteger(line.productid)})
MATCH (c:Companies {id: toInteger(line.companyid)})
MERGE (f)-[:has{
seller: toInteger(line.seller),
bulk_seller: toInteger(line.bulk_seller),
producer: toInteger(line.producer),
}]->(b)
} in transactions

Is there a recommendations for batch size? My machine is a moderate powerful i7-7700 with 32 GB of RAM.
Should larger Imports only be made with the graph not started? What am I doing wrong here?

Edit: Its 5.13 on Windows 10 Machine

glilienfield · November 23, 2023, 10:10am

Do you have indexes defined on the “id” fields? This should greatly help with this query, as each relationship you creates needs to find the product and company nodes. Without them, it will need to do a full node scan twice.

A batch size of 10,000 should be ok.

There is another option:

brcg · November 23, 2023, 10:51am

Yes it would totaly make sense to put an index on companies.id and products.id to speed up relationship import. Totally forgot about that, my bad.

But what about the companies initial import from Database to nodes? Would it be faster if the database is not running at the time of import? Is CSV import via neo4j-admin the preferred method for larger amounts of data?

Topic		Replies	Views
Importing relationships from multiple csv file Import / Export performance , load-csv	12	3198	June 5, 2020
Load large CSV with LOAD CSV or python Neo4j Graph Platform migrated	2	1128	August 4, 2023
CSV import issue Import / Export	26	712	June 21, 2023
Performance issue when importing CSV relationships Import / Export performance , import , csv , index	2	2083	January 28, 2019
Load-CSV very slow with millions of nodes Import / Export load-csv , import , neo4j-import , csv , neo4j	10	11546	April 7, 2022

July Summer Fun!

Preferred method to load medium-small and medium amounts of data

Related topics