Neo4j.Result in python driver neo4j very slow

Hi there,

first of all I am using neo4j Community Edition 4.4.8 and Python driver neo4j == 1.7.2. I have also tried neo4j == 4.4.0 but it didn't solve my problem. I am running it on a remote server.

What I am trying to do:

I am using a dump of Wikidata which in neo4j is 200 GB and it requires around 30 GB of RAM to run. For certain nodes I want to get the number of outgoing edges returned.

 def get_number_of_neighbors(self, nodeid):
        with self.driver.session() as session:
            result = session.read_transaction(self.__fetch_number_of_neighbors, nodeid)

            return result

    #create cypher query to get outgoing edges from "nodeid"
    def __fetch_number_of_neighbors(self, tx, nodeid):
        try:

            result = tx.run("MATCH (n:e {nodeid:'" + nodeid + "'})-[r]->() RETURN COUNT(r) as r")
   
            unpack_result = [i['r'] for i in result]
            return unpack_result


        except:
            return 0

So it will return a list with the number of outgoing edges.

The Problem:

It is way to slow. With neo4j == 1.7.2 the query is really fast but the unpacking of the neo4j.BoltStatementResult (line 12) takes routhly a minute. When using neo4j == 4.4.0 the query is the part that takes more time.I have tried other ways of unpacking the result (.single().value() or data()) but nothing helps...

So does anyone know why this could be slow?

Thank you :slightly_smiling_face:

Hello @p_xcx

I wanted to see if you are still trying to improve your query or if you were able to come up with a solution.
If you are still working towards a solution, you can try indexing :e(nodeid) which should help to improve your performance.
Also, What if you replaced

MATCH (n:e {nodeid:'" + nodeid + "'})-[r]->() RETURN COUNT(r) as r

with

MATCH (n:e {nodeid:'" + nodeid + "'}) return size ( (n)-[r]->() );

This should also help to improve the speed!

Thanks,

Hey Trevor,

thanks for your help!

That is a better option and makes it faster, but my problem is somewhere else. I am still working on in. Probably has to to with my dataset.

Thanks,

Ellis

Hi @p_xcx given that you've said your dataset is quite large and you're running on community, this is one of those classic performance setups where your issue may be in your page cache size.

You've said your data is 200GB and your memory is 30GB. What's likely happening in this situation is you have a page cache that's too small, so most of your time is going into the server copying data back and forth between disk (access is slow) and page cache (access is fast). The nature of your query is that it's looking through a lot of different relationships, I'm sure you have many GB of those. The query you're running doesn't require a lot of heap, but you would probably benefit from as high of a page cache as you can afford to give it to improve performance.

Think about page cache size in terms of a % of your overall database size. If it's something like 10%, this means that the next piece of data the database needs is very unlikely to be in memory, it will probably have to go to disk for that.

If you want to confirm/deny whether I'm right on this before spending more on RAM or reconfiguring, you can do that by checking Neo4j internal metrics. Look for "page cache faults" (that is the number of times the database didn't find something in page cache). Check the value, run the query, then check the value again. If the number of page cache faults is much much higher, almost certainly this is the issue. https://neo4j.com/docs/operations-manual/current/monitoring/metrics/reference/

Relevant link where you can read more: https://neo4j.com/developer/guide-performance-tuning/#:~:text=The%20size%20of%20the%20available,enough%20to%20run%20Neo4j%20reliably.

Hi @david_allen ,

thank you for your answer. I think you are right about the page cache. I looked your suggestion checking the internal metric, but it wasn't quite clear how I can do it and if I can do it with my community edition.

I have realised that some queries take longer than others. I am putting the nodes into RAM with :

CALL apoc.warmup.run(True,True,True);

and then the following query is fast (collects up to 1000 paths starting at node Q886):

MATCH (n:e {nodeid: 'Q886'}) CALL apoc.path.expandConfig(n, {   minLevel: 1, maxLevel: 1}) YIELD path RETURN nodes(path) as nodes, relationships(path) as relations LIMIT 1000;

But when I want to count the number of outgoing relationships it takes longer and has to go trough more hits:

Do you understand why this is so and if it is the cause of the long execution time?

Thank you so much!

Ellis