Billion graph query optimize problem

Bao · September 20, 2024, 4:53pm

Hi everyone!

I need to fetch the graph structure( including only the node type and edge type, and no node information is needed), which is used to re-construct edge lists from the graph structure to train the machine learning model on top of that. Our graph database contains 16 billion nodes and 88 billion edges and my query takes so long to run.
Here is my query:

match (n:Species)

where n.name= 'Pisum_sativum' or n.name= 'Pisum sativum'

optional match (n)-[r1]-(hop1)

where (hop1:Gene or hop1:Metabolite or hop1:Pathway or hop1:Protein)

optional match (hop1)-[r2]-(hop2)

where (hop2:Gene or hop2:Metabolite or hop2:Pathway or hop2:Protein)

optional match (hop2)-[r3]-(hop3)

where (hop3:Gene or hop3:Metabolite or hop3:Pathway or hop3:Protein)

optional match (hop3)-[r4]-(hop4)

where (hop4:Gene or hop4:Metabolite or hop4:Pathway or hop4:Protein)

RETURN n, hop1, hop2, hop3, hop4, r1, r2, r3, r4

skip {order * batch_size} limit {batch_size}

A quick explanation about my query: I want to get a subgraph that is a 4-hop neighbour of species with the given name. In addition, I only want to get the node that has type Gene or Metabolite or Pathway pr Protein. I am implementing multi-thread to perform parallel fetching of information from the database. To do that I use skip and limit to break query data into smaller chunks, in which different threads execute different queries with different values of order. I keep fetching data until there is len(results) ==0, in which results are the list of results returned by a query. I am attaching the plan of the query below here for further information about the query:

*please just ignore other node types besides Gene or Metabolite or Pathway pr Protein

I am trying to optimize the speed of the query by creating an index on the name of species nodes and using threads for parallel fetching here. But there is a weird pattern of data coming from the results when the number of orderincreaseses (meaning I skip more data that already returned from previous queries), the total number of new relations (the relations not returned from previous queries) returned by 30 queries ( each query is executed by different thread with a different value for order). In other words, later queries return repeated results from previous queries. My fetching process has been running for > 12 hours. I am not sure if this is the problem with my query or my parallel approach. I would very much appreciate it if anyone could help me improve my query or correct my approach.
Thanks a lot in advance.

dana_canzano · September 20, 2024, 7:02pm

@Bao

Neo4j Version?

MATCH - Cypher Manual not sure how much if at all any impact but you can rewrite

optional match (n)-[r1]-(hop1)

where (hop1:Gene or hop1:Metabolite or hop1:Pathway or hop1:Protein)

to

optional match (n)-[r1]-(hop1:Gene|Metabolite|Pathway|Protein)

Bao · September 20, 2024, 8:09pm

Thanks for your response! I am currently using 5.16.0 version

glilienfield · September 21, 2024, 1:26am

I think you will be much better off with an APOC path method. SubgraphAll will return the list of nodes and relationships, which is what you want.

The cypher query is going to create a massive number of rows of duplicate data.

Bao · September 24, 2024, 2:45pm

Thank you so much for your help @dana_canzano and @glilienfield, the query is now much more efficient.

Topic		Replies	Views
Questions about my query model Cypher querying , optimization , cypher , subquery	13	137	March 6, 2025
Is there any way of querying on small part of a big graph in graph db Cypher cypher , knowledge-base	3	303	August 3, 2020
Approaches to scaling very large graph queries Neo4j Graph Platform migrated	3	249	July 14, 2022
How to fetch millions of data faster? Cypher	30	8500	October 18, 2019
Subgraph query in graphDB Cypher cypher	3	315	September 25, 2021

August Summer Fun!

Billion graph query optimize problem

Related topics