Hi all,
I am interfacing with my Neo4j DBMS from Python using the GraphDatabase driver.
I am running a cypher running personalized page rank on a graph projection and returning a single value. This occurs fast enough (~0.8s for a graph with about 12k nodes and 200k edges).
Somehow, converting to a panda data frame, this takes ~70s. (Neo4j v 5.23.0)
This is my code snippet:
import neo4j
from neo4j import GraphDatabase
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
n4j = driver.session(database=NEO4J_DATABASE)
cypher_1 = """
MATCH (g:Gene{name:$excluded_source})
RETURN id(g) AS excluded_source_id
"""
cypher_2 = """
MATCH (crc_genes:Gene{source:'TCGA'}) WHERE crc_genes.name <> $excluded_source
WITH crc_genes
CALL gds.pageRank.stream($graph_proj_name, {
dampingFactor: 0.85,
maxIterations: 20,
sourceNodes: [id(crc_genes)]
}) YIELD nodeId, score
WHERE nodeId = $excluded_source_id
WITH gds.util.asNode(nodeId) AS node, SUM(score) AS score
RETURN COALESCE(node.name, node.ENSP) AS gene_identifier, score
"""
# Cypher returning node ID of gene of interest
result_1 = n4j.run(cypher_1, excluded_source = EXCLUDED_SOURCE, graph_proj_name = PROJECTION_NAME_1)
excl_source_id = neo4j.Result.to_df(result_1)
EXCLUDED_SOURCE_ID = excl_source_id.loc[0, "excluded_source_id"]
# Cypher calculating personalized page rank
result_2 = n4j.run(cypher_2, excluded_source = EXCLUDED_SOURCE, graph_proj_name = PROJECTION_NAME_1)
# Line that takes over a minute to run!
df = neo4j.Result.to_df(result_2)
The output looks something like this:
print(df)
gene_identifier score
0 APC 0.093877
In the past (a different graph), this has taken fractions of a second:
Neo4j v 5.21.0 and a smaller graph - 1.3k nodes and 19k edges.
Edit: With the smaller graph, the line also runs quickly in v 5.23.0.
Why is it taking so long? I have tried returning just the score as a float; I also tried filtering before RETURN within the cypher, as you see in the code, but this changes nothing.
Any help is appreciated.