In Python neo4j.Result.to_df() slow conversion to pandas df (for very small result)

dvargas · September 23, 2024, 4:06pm

Hi all,

I am interfacing with my Neo4j DBMS from Python using the GraphDatabase driver.

I am running a cypher running personalized page rank on a graph projection and returning a single value. This occurs fast enough (~0.8s for a graph with about 12k nodes and 200k edges).

Somehow, converting to a panda data frame, this takes ~70s. (Neo4j v 5.23.0)

This is my code snippet:

import neo4j
from neo4j import GraphDatabase

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
n4j = driver.session(database=NEO4J_DATABASE)

cypher_1 = """
MATCH (g:Gene{name:$excluded_source})
RETURN id(g) AS excluded_source_id
"""

cypher_2 = """
MATCH (crc_genes:Gene{source:'TCGA'}) WHERE crc_genes.name <> $excluded_source
WITH crc_genes
CALL gds.pageRank.stream($graph_proj_name, {
dampingFactor: 0.85,
maxIterations: 20,
sourceNodes: [id(crc_genes)]
}) YIELD nodeId, score
WHERE nodeId = $excluded_source_id
WITH gds.util.asNode(nodeId) AS node, SUM(score) AS score
RETURN COALESCE(node.name, node.ENSP) AS gene_identifier, score
"""

# Cypher returning node ID of gene of interest
result_1 = n4j.run(cypher_1, excluded_source = EXCLUDED_SOURCE, graph_proj_name = PROJECTION_NAME_1)

excl_source_id = neo4j.Result.to_df(result_1)

EXCLUDED_SOURCE_ID = excl_source_id.loc[0, "excluded_source_id"]

# Cypher calculating personalized page rank
result_2 = n4j.run(cypher_2, excluded_source = EXCLUDED_SOURCE, graph_proj_name = PROJECTION_NAME_1)

# Line that takes over a minute to run!
df = neo4j.Result.to_df(result_2)

The output looks something like this:

print(df)

 gene_identifier     score
 0           APC  0.093877

In the past (a different graph), this has taken fractions of a second:
Neo4j v 5.21.0 and a smaller graph - 1.3k nodes and 19k edges.
Edit: With the smaller graph, the line also runs quickly in v 5.23.0.

Why is it taking so long? I have tried returning just the score as a float; I also tried filtering before RETURN within the cypher, as you see in the code, but this changes nothing.

Any help is appreciated.

dvargas · September 26, 2024, 9:16am

The difference between the operations performed on the two graphs (slow and fast at converting to df) was not only in size but more importantly the number of sources I used for personalized page rank.

In the end this made the algorithm slower. What is still unclear is why only upon execution of neo4j.Return.to_df() is this observed. My guess is that n4j.run(cypher) does not run but queue the command to be run?

Topic		Replies	Views
Neo4j.Result in python driver neo4j very slow Neo4j Graph Platform migrated	4	248	August 11, 2022
Cypher result to pandas dataframe issue Cypher cypher	1	1174	May 11, 2020
Fast Export Cypher performance	0	914	December 13, 2018
Load pandas data frame to neo4j database in batches Import / Export	1	965	September 25, 2023
Question about python neo4j-driver processing muti-threads (concurrent) Python performance	0	1098	October 4, 2019

Get Certified in June!

In Python neo4j.Result.to_df() slow conversion to pandas df (for very small result)

Related topics