How can I improve the performance of this query?

gabrielsantos.ifce · February 21, 2019, 7:22pm

Hello guys,

I'm using the apoc.periodic.iterate to create a graph from a csv file containing 10k rows (to start off), but I think it's taking longer than it should to finish the command (around 50 seconds). Here is how the entire command looks like and the image of the results:

CALL apoc.periodic.iterate("LOAD CSV WITH HEADERS FROM 'file:///vendas_cupom_subset.csv' AS file RETURN file", 
"MERGE (n:Cliente {cpf: file.CPF}) MERGE (m:Cupom {codigoCupom: file.NUMEROUNICOCUPOM}) MERGE (o:Produto {codigoProduto: file.CODIGOPRODUTO}) MERGE (p:Filial {codigoFilial:file.CODIGOFILIAL})  MERGE (n)-[:FEZ_PEDIDO {dataMovimento: file.DATAMOVIMENTO}]->(m) MERGE (m)-[:CONTEM {qtd:file.QTD_PRODUTO}]->(o) MERGE (o)-[:VENDIDO]->(p)", 
{batchSize: 1000, iterateList:true, parallel: false});

I'm running it on neo4j desktop. My computer is aN i5 vpro 2.50 ghz, 8gb ram and 200 gb ssd. I didn't alter the configurations to increase the heap/ram memory, so it's default. I also tested running the same command on cypher-shell, but the difference was very little. Also tried changing the batchsize to 100 and 10k, but it got worse.

I'm really concerned because I'll have to create the same graph, reading from the same csv containing millions of lines, and it doesn't seem very performatic.

How can I improve my query, having in mind that in the future i'll have to insert millions of registers.?

here's how my csv file looks like:

If I set the parallel parameter 'true', it inserts faster, but with failed operations:

stefan.armbruster · February 22, 2019, 5:33pm

Since you do MERGE on nodes during import, be sure to have indexes in place before running the import. In your case:

create index on :Cliente(cpf);
create index on :Cupom(codigoCupom);
create index on :Produto(codigoProduto);
create index on :Filial(codigoFilial);

Run them either one-by-one or in one block with the multi-statement editor config enabled. Then try the import again.

gabrielsantos.ifce · February 22, 2019, 7:46pm

Thank you very much! The time decreased from 50 seconds to 1 second.

Do you know if it's possible to run the command with "parallel:true" without the failed operations? If not, what is causing the failures?

Thanks again, in advance.

stefan.armbruster · February 23, 2019, 8:11pm

parallel: true makes sense if you don't have overlapping locks in concurrently running batches. Otherwise they have to wait for each other, which effectively serializes the parallel operations.

Please take a look a log/debug.log regarding suspicious messages there. The NullPointerException indicates some unexpected behaviour.

gabrielsantos.ifce · February 23, 2019, 8:13pm

Alright, thank you very much.

michael.hunger · April 4, 2019, 7:54am

Also you need to check your heap + page-cache config for your large import.
heap should be 2-8G and PC should be according to neo4j-admin memrec but best as large as your expected store. You can build a small store and multiply the size.

You can increase page-size to 100k. with parallel:false in your case.

Topic		Replies	Views
Understanding `apoc.periodic.iterate` parallel performance Cypher apoc , performance , browser , cypher	4	621	November 1, 2023
How to make load csv go faster? Neo4j Graph Platform migrated , cypher-tagged	2	185	December 23, 2022
How to make load csv go faster? Cypher	17	3940	November 20, 2023
Creating Large Number of Edges with `apoc.periodic.iterate` Cypher apoc , performance , cypher	3	399	October 24, 2023
Statement using Apoc Periodic Iterate gets stuck, but works without the iterate Cypher apoc , cypher	3	176	March 10, 2023

How can I improve the performance of this query?

Related topics