PageRank on subgraph of large graph

apoc
procedure
(Anvvee) #1

I am working with a large graph of Wikipedia topics having more than 16 million nodes and 400 million edges. Trying to find important topics from the set of some topics. For that, I want to use PageRank on a subgraph of the Wikipedia graph. Using PageRank by using match cypher query is taking a lot of time > 5 mins. How can it be optimized? see the first comment below

Is there a way by which I can get subgraph of the whole graph. like If I have titles of the topic, I want to get all nodes and relationship among matched nodes.

0 Likes

(Anvvee) #2

Currently for pagerank for a subgraph having topics Sport_utility_vehicle, Sports_car and Luxury_vehicle
I am using the below query, which is very slow.

CALL algo.pageRank.stream(
'MATCH (t:TOPIC) WHERE t.title = "Sport_utility_vehicle" or t.title = "Sports_car" or t.title = "Luxury_vehicle" RETURN id(t) as id',
'MATCH (t1:TOPIC)-[:hasLink]-(t2:TOPIC) RETURN id(t1) as source, id(t2) as target',
{graph:'cypher'}
) YIELD nodeId, score with algo.getNodeById(nodeId) as node, score order by score desc limit 10
RETURN node, score

0 Likes

(Morenobonaventura) #3

Have you tried using an index on the "title" property?

CREATE INDEX ON :Topic(title)

You can check if the index is ready or still populating by running the following cypher command ":schema"

0 Likes

(Anvvee) #4

Yes, I've created an index on the "title" company. But still the pagerank takes a lot of time. IDK why

0 Likes

(Morenobonaventura) #5

How many node and relationships are you working with? Is the data all cached in ram? How big is the machine hosting the db?

First thing I would try to understand is if the slow bit is the data load, or the actual execution of the page rank, or the sorting done at the end.

Can you check the neo4j.log and debug.log files and share here the relevant section?

0 Likes

(Anvvee) #6

Hi, My whole graph has 16 million nodes and 500 million relationships. But for Pagerank I need to query on subgraph which will consist of less than 100 nodes and 10000 relationships. The machine hosting the db has 120GB of RAM. Data loading is fast, The problem mainly lies on the execution of pagerank or probably I am not able to get the subgraph efficiently. Do you know any ways to get subgraph of a few nodes efficiently.

0 Likes

(Morenobonaventura) #7

Ok, that makes things clearer.

I think you could speed up by using the index on the property title in the second query.

CALL algo.pageRank.stream(
'MATCH (t:TOPIC) WHERE t.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"] RETURN id(t) as id',
'MATCH (t1:TOPIC)-[:hasLink]-(t2:TOPIC) WHERE t1.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"] RETURN id(t1) as source, id(t2) as target',
{graph:'cypher'}
) YIELD nodeId, score with algo.getNodeById(nodeId) as node, score order by score desc limit 10
RETURN node, score

Two more suggestions:

  • Have a look at the file $NEO_HOME/logs/neo4j.log, it provides timings of the various phases of the execution (loading, computing, writing)
  • Check the query execution plan by running each query separately prepended by the keyword EXPLAIN. It will show if indexes will be actually used during the query execution.
0 Likes

(Anvvee) #8

I did the same hack to fasten up the query, Is there some other way so that I don't have to put the same conditions on edges(i.e they belong to set of nodes). I'll have a look into your suggestions tomorrow and get back to you

0 Likes

(Anvvee) #9

Do you know how can this query be modified to not only include the nodes with (t.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"]) but also their adjacent nodes. i.e 1 hop away

0 Likes