PageRank on subgraph of large graph

Anvvee · March 22, 2019, 8:04am

I am working with a large graph of Wikipedia topics having more than 16 million nodes and 400 million edges. Trying to find important topics from the set of some topics. For that, I want to use PageRank on a subgraph of the Wikipedia graph. Using PageRank by using match cypher query is taking a lot of time > 5 mins. How can it be optimized? see the first comment below

Is there a way by which I can get subgraph of the whole graph. like If I have titles of the topic, I want to get all nodes and relationship among matched nodes.

Anvvee · March 22, 2019, 8:19am

Currently for pagerank for a subgraph having topics Sport_utility_vehicle, Sports_car and Luxury_vehicle
I am using the below query, which is very slow.

CALL algo.pageRank.stream(
'MATCH (t:TOPIC) WHERE t.title = "Sport_utility_vehicle" or t.title = "Sports_car" or t.title = "Luxury_vehicle" RETURN id(t) as id',
'MATCH (t1:TOPIC)-[:hasLink]-(t2:TOPIC) RETURN id(t1) as source, id(t2) as target',
{graph:'cypher'}
) YIELD nodeId, score with algo.getNodeById(nodeId) as node, score order by score desc limit 10
RETURN node, score

morenobonaventura · March 23, 2019, 10:15am

Have you tried using an index on the "title" property?

CREATE INDEX ON :Topic(title)

You can check if the index is ready or still populating by running the following cypher command ":schema"

Anvvee · March 24, 2019, 11:12am

Yes, I've created an index on the "title" company. But still the pagerank takes a lot of time. IDK why

morenobonaventura · March 24, 2019, 4:29pm

How many node and relationships are you working with? Is the data all cached in ram? How big is the machine hosting the db?

First thing I would try to understand is if the slow bit is the data load, or the actual execution of the page rank, or the sorting done at the end.

Can you check the neo4j.log and debug.log files and share here the relevant section?

Anvvee · March 24, 2019, 4:46pm

Hi, My whole graph has 16 million nodes and 500 million relationships. But for Pagerank I need to query on subgraph which will consist of less than 100 nodes and 10000 relationships. The machine hosting the db has 120GB of RAM. Data loading is fast, The problem mainly lies on the execution of pagerank or probably I am not able to get the subgraph efficiently. Do you know any ways to get subgraph of a few nodes efficiently.

morenobonaventura · March 24, 2019, 6:04pm

Ok, that makes things clearer.

I think you could speed up by using the index on the property title in the second query.

CALL algo.pageRank.stream(
'MATCH (t:TOPIC) WHERE t.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"] RETURN id(t) as id',
'MATCH (t1:TOPIC)-[:hasLink]-(t2:TOPIC) WHERE t1.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"] RETURN id(t1) as source, id(t2) as target',
{graph:'cypher'}
) YIELD nodeId, score with algo.getNodeById(nodeId) as node, score order by score desc limit 10
RETURN node, score

Two more suggestions:

Have a look at the file $NEO_HOME/logs/neo4j.log, it provides timings of the various phases of the execution (loading, computing, writing)
Check the query execution plan by running each query separately prepended by the keyword EXPLAIN. It will show if indexes will be actually used during the query execution.

Anvvee · March 24, 2019, 6:21pm

I did the same hack to fasten up the query, Is there some other way so that I don't have to put the same conditions on edges(i.e they belong to set of nodes). I'll have a look into your suggestions tomorrow and get back to you

Anvvee · March 25, 2019, 10:13am

Do you know how can this query be modified to not only include the nodes with (t.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"]) but also their adjacent nodes. i.e 1 hop away

Topic		Replies	Views
apoc.path.subgraphAll - find nodes with highest pagerank Procedures & APOC apoc , cypher	2	268	February 13, 2023
Graph Data science neo4j Page Rank Cypher performance	3	383	June 7, 2021
Efficient way to calculate PageRank with multiple graphs Procedures & APOC	0	532	December 20, 2019
GDS create a specific sub graphs Graph Algorithms/Graph Data Science	2	567	August 6, 2020
Gds page rank issue Graph Algorithms/Graph Data Science	5	84	March 20, 2025

August Summer Fun!

PageRank on subgraph of large graph

Related topics