How to improve this query?

Hi! Good morning friends.

I'm trying to find out what is the last nodes that have a relationship created for another node in my project.
For that I wrote this Cypher statment:
MATCH (CPF1:PF)-[:PARENTE_DE]->(CPF2:PF) return CPF1.CPF ORDER BY CPF1.CPF desc limit 10

This statmant is running too slow. I'd like to know how to improve this query.


Can you do a PROFILE of the query instead, and expand all elements of the resulting plan (double-down arrow in the lower right of the result pane)?

After that's done, you can use the export option (first button in the upper right of the result pane) to export the plan image and add it here.

It would also help to know how much memory is configured for pagecache and heap (from neo4j.conf) as well as the total memory available on your machine.

Thank you Andrew!

The machine has 24gb RAM and 12 processors. Pagecache.size=16gb. heap.initial_size=2g. heap.max_size=4g.

The PROFILE of the query takes almost 20 minutos to run. Too slow. The cypher statment is well written, right? I create a index for the property CPF of PF node, but i doesn't help to speed up the query.

If [:PARENTE_DE] relationships can only connect to other :PF nodes, then you can substitute a degree check instead of doing the match:


The first MATCH matches to 153 million nodes. If only a fraction of those nodes meets the criteria, there may be no way around having to filter through a whole bunch until you get the 10 you're looking for.

Perhaps if the data doesn't change much, and if this query is frequent, you can try adding a new label to just these nodes, and then subsequent lookup by the label should be quick.

also what does "last" mean?
with a recent Neo4j version and an index on CPF1.CPF (you should really get better names for your labels and properties) you get index backed sorting and pagination.


wait till index is online

explain MATCH (CPF1:PF)
  AND (CPF1)-[:PARENTE_DE]->() 

Hi Micheal! Thank you for you reply!

I've already created this index. It's already online.

I'm gonna try to explain my cenario. I'm running a huge ETL process (about 1 billion rows) migrating data from SQL Server to Neo4j. The (:PF) nodes are already loaded, so right now i'm trying to load the relationships beetwen them. This load process is running too slow for my needs. So, in order to improve the load statment, i stopped the process to change some configurations in the server and in the cypher statment as well.

I need to find out what is the last value in the property CPF (label PF) in order to continue the process from that point. I would like to find out this value in a better way, because the query is running too slow. Every time that i stop the ETL process i need to find out the last value and it's taking me a lot of time to do that.

By the way, the ETL statment is:

"SELECT CPF1, UPPER(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(relacao,'ã', 'a'),'ô','o'),'ó','o'),'í','i'),'é','e')) AS REL, CPF2
FROM [BD_BG_SRF_P].[INFOCONTAS].[RELACAO_PF_PF] as RPF WHERE CPF1 >= '08593160689'" as query,
"jdbc:sqlserver://TCEBIDES12:1433;databaseName=BD_BG_SRF_P;user=;password=*" AS url
CALL apoc.periodic.iterate(
'CALL apoc.load.jdbc($url, $query) YIELD row',
where p1.CPF=row.CPF1 and p2.CPF=row.CPF2
create (p1)-[:PARENTE_DE {grau: row.REL}]->(p2)
{batchSize:10000,iterateList:true,parallel:true,params:{url: url, query: query}}) YIELD batches
RETURN batches

The value '08593160689' would be the last value loaded in Neo4j than i intend to continue from that point.

Would be possible to improve the ETL statment as well?

Thank you!