Billion graph query optimize problem

Hi everyone!

I need to fetch the graph structure( including only the node type and edge type, and no node information is needed), which is used to re-construct edge lists from the graph structure to train the machine learning model on top of that. Our graph database contains 16 billion nodes and 88 billion edges and my query takes so long to run.
Here is my query:

match (n:Species)

where n.name= 'Pisum_sativum' or n.name= 'Pisum sativum'

optional match (n)-[r1]-(hop1)

where (hop1:Gene or hop1:Metabolite or hop1:Pathway or hop1:Protein)

optional match (hop1)-[r2]-(hop2)

where (hop2:Gene or hop2:Metabolite or hop2:Pathway or hop2:Protein)

optional match (hop2)-[r3]-(hop3)

where (hop3:Gene or hop3:Metabolite or hop3:Pathway or hop3:Protein)

optional match (hop3)-[r4]-(hop4)

where (hop4:Gene or hop4:Metabolite or hop4:Pathway or hop4:Protein)

RETURN n, hop1, hop2, hop3, hop4, r1, r2, r3, r4

skip {order * batch_size} limit {batch_size}

A quick explanation about my query: I want to get a subgraph that is a 4-hop neighbour of species with the given name. In addition, I only want to get the node that has type Gene or Metabolite or Pathway pr Protein. I am implementing multi-thread to perform parallel fetching of information from the database. To do that I use skip and limit to break query data into smaller chunks, in which different threads execute different queries with different values of order. I keep fetching data until there is len(results) ==0, in which results are the list of results returned by a query. I am attaching the plan of the query below here for further information about the query:


*please just ignore other node types besides Gene or Metabolite or Pathway pr Protein

I am trying to optimize the speed of the query by creating an index on the name of species nodes and using threads for parallel fetching here. But there is a weird pattern of data coming from the results when the number of orderincreaseses (meaning I skip more data that already returned from previous queries), the total number of new relations (the relations not returned from previous queries) returned by 30 queries ( each query is executed by different thread with a different value for order). In other words, later queries return repeated results from previous queries. My fetching process has been running for > 12 hours. I am not sure if this is the problem with my query or my parallel approach. I would very much appreciate it if anyone could help me improve my query or correct my approach.
Thanks a lot in advance.

@Bao

Neo4j Version?

MATCH - Cypher Manual not sure how much if at all any impact but you can rewrite

optional match (n)-[r1]-(hop1)

where (hop1:Gene or hop1:Metabolite or hop1:Pathway or hop1:Protein)

to

optional match (n)-[r1]-(hop1:Gene|Metabolite|Pathway|Protein)

Thanks for your response! I am currently using 5.16.0 version

I think you will be much better off with an APOC path method. SubgraphAll will return the list of nodes and relationships, which is what you want.

The cypher query is going to create a massive number of rows of duplicate data.

Thank you so much for your help @dana_canzano and @glilienfield, the query is now much more efficient.

1 Like