Querying a large graphDb

Hi, great minds! I am new to neo4j and currently exploring an existing graph to extract data for downstream tasks.

I would like to get all pairs of nodes and their relationship from the graph.

MATCH (n)-[r]-(n1) WHERE n<>n1 AND n1>n RETURN *

This will return about 12,726,288 estimated rows.

Instead, I decided to extract the pairwise information between 2 node types

MATCH (n:Node{type:nodetypeA})-[r]-(n1: Node{type:nodetypeB}) WHERE n<>n1 AND id(n)<id(n1) RETURN *

with 653,022 estimated rows; sadly, neo4j has timeout continuously. I have increased the connection timeout (ms) through the neo4j browser, yet nothing works differently.

Any suggestion will be highly appreciated.

Hello @wumirose !

Why do you think this may be a timeout problem? This may be a Desktop OOM rendering problem. You may be asking for too much info to be displayed. Have you try with a driver of your preference? My personally, I have used SDN6 with Webflux without problems.

Bennu

I agree; probably the 653,022 rows are too much to extract.

Thanks a lot for your suggestion about SDN6 and Webflux, It's my first time learning a bit about reactive programming. However, it appears the reactive clients provide no support for Python, and I currently run my queries in Python Environment and connect to neo4j with py2neo.

Any further suggestions will be greatly appreciated.

Hi @wumirose

Have you tried with https://neo4j.com/docs/api/python-driver/current/api.html#graphdatabase ? This one should work as stream AFAIK.

Try something like

from neo4j import GraphDatabase

user = "youUsername"
password = "yourPassword"
uri = "yourUri"
driver = GraphDatabase.driver(uri, auth=(user,password))
with driver.session() as session:
    result = session.run("MATCH (n:Node{type:nodetypeA})-[r]-(n1: Node{type:nodetypeB}) WHERE n<>n1 AND id(n)<id(n1) RETURN n as node1, n1 as node2, r as rel")
    for record in result:
             print("node1 {}".format(record["node1"]))

Lemme know how it goes

The API does the trick! I'm so happy right now.

Thank you so much @bennu_neo, for your help. It means a lot!

@wumirose you are welcome! Enjoy it!

I noticed that the query gives only the first-order connection between 2 nodes; however, I will need at least the second-order relationship for my downstream application. I have tried:

MATCH (n)-[r*1..2]-(n1) 
WHERE n<>n1 AND id(n)<id(n1) 
WITH n.name as Name1, n1.name as Name2, r AS rel
UNWIND rel AS rl
RETURN Name1, Name2, Id, rl.id AS relId

which estimated about 2 million rows.

I would like to skip some rows so I can eventually end up with less than 1 million (~500,000). I have played with a few other queries like SKIP and LIMIT, but I can't seem to get a helpful result.

Your suggestions will be greatly appreciated.

Hi @wumirose !

This is technically another question, but let's do it :smile:

In general, I don't agree with this whole db export stream but if it works for you. It's fine. Can you try a query like?

MATCH p = (n)-[*1..2]-(n1) 
WHERE id(n)<id(n1) 
WITH n.name as Name1, n1.name as Name2, relationships(p) as rel
SKIP 10
LIMIT 10
UNWIND rel AS rl
RETURN Name1, Name2, rl.id AS relId

Keep in mind that limit and skip will apply on the *WITH* step.

Actually, the match is between 2 node types, not the whole db😉.

MATCH p = (n:Node{type: 'typeA')-[*1..2]-(n1: Node(type: 'typeB') 

This is so helpful! I haven't explored SKIP and LIMIT use before RETURN. Thanks a bunch for that.