Hello there!
I am new to neo4j, so I would be happy to receive any help or hints. Using Neo4j 5.20.0 on a Linux server with 32GB RAM and 8 vCPUs, as well as Neo4j Desktop 1.5.8.
I want to analyse data of orders: 370,000 items, 900,000,000 sales, and 130,000,000 orders. I have synchronized data within 3 days. This data weighs ~300 GB in Neo4j. Did it with small batches of 32,000 rows. But now it seems that I did it wrongly. I have created edges as follows: Order - Sale ⇾ Item. And I have created an index: CREATE INDEX item_index FOR (i:Item) ON (i.id)
.
So my questions:
- Maybe I need to do it vice versa? (Item - Sale ⇾ Order)
- Is it normal to load data of this size within 3 days? Items were loaded within 60 seconds in multithread. One batch of items was loaded in around 3–4 seconds. One batch of sales and orders was loaded in around 6–8 seconds. But there was a problem with multithread loading, so I did it in single thread.
This is how I load my sales and orders data:
UNWIND $sales AS sale
CREATE (o:Order {number: sale[3], date: sale[0]})
WITH o, sale
MATCH (i:Item {id: sale[2]})
CREATE (o)-[r:SALE {
order_number: sale[3], number_in_order: sale[4],
price: sale[5], valuerub: sale[6], valuesht: sale[7],
val: sale[8],
sebes: sale[9], price_type_id: sale[10]
}]->(i)
- I want, for example, to analyse which items are often bought with another item, so the query will look like this:
MATCH (target:Item {id: 155868})<-[:SALE]-(o:Order)-[:SALE]->(i:Item)
WHERE i.id <> 155868
RETURN i.id AS item_id, COUNT(*) AS co_occurrence_count
ORDER BY co_occurrence_count DESC;
However, when I run it, the query takes 97477 ms to execute. My 8 cores are utilized on 0-2%, and RAM usage is not increasing. Maybe I need to adjust some Neo4j settings to make it work faster?
Plan of the query: