Query: Matching subgraph recursively

Hello dear community!

the graph of my database has a hierarchical tree-like structure with subgraphs connected to the leafs. Now I want to match all nodes and edges in one of these subgraphs, which have arbitrary depth, labels, properties and edge direction (but there is certainly no label equal to the leafs label). For this reason it is necessary to match such a subgraph recursively. The only information given is the leaf and the node connected to the leaf in the parent hierarchy level. So let's assume one path in the tree-like structure looks as following:

(:Project {name:"Project1"})-->(:Submission {name:"Submission"})-->......

where ..... denotes the whole subgraph I want to match. The submission node would be the leaf in this case.

I've tried a lot, but in the following I will show the two solutions which seem to me best by now with a few thoughts:

MATCH (:Project {name:"Project1"})-->(s:Submission {name:"Submission"})-[*]-(x)
WHERE NOT x:Project AND NOT x:Submission

the WHERE clause is necessary as (I guess) the query will otherwise return to the submission node and recursively search the whole graph. Nevertheless, this query does not terminate, although the subgraph has only about 200 graphs. Very interesting is, that if I specifiy a depth, such as [*1..12] the query will terminate and deliver the same result as for [*1..13]. I can't explain myself why the limitless recursion wouldn't work then.

The other query I tried uses a procedure:

'''MATCH (:Project {name:"Project1"})-->(s:Submission {name:"Submission"})
CALL apoc.path.subgraphAll(s, {labelFilter: "-Submission-Project"}) yield nodes
return s, nodes'''

This query won't terminate as well... Is there anything I'm missing out here?

Thanks a lot!

Best wishes,

  1. Negation in Cypher is particularly expensive, as it requires first load all potential matches, then removing the negatives.
  2. The tie-fighter )-[*]-( will attempt to load every possible path in your graph.

Designing a graph is a critical step in ensure usable and performant queries. In this case, I strongly advise following a standard for relationship directionality, and applying a generic label to nodes that are part of your, "subgraph." At the very least, we will need a much more clear picture of your data, and your desired results, to be more helpful. Additionally, please attempt to use valid cypher, within code-blocks (```) so your posts are more legible.

First, we'll clean up the graph a bit. Add a generic label, and fix the unknown directionality after :Submission.

WHERE NOT n :Project AND NOT n:Submission
SET n :SubGraph;

MATCH (a:Submission)-[r]-(b:SubGraph)
CREATE (a)-[rNew:NEWREL]->(b)
SET rNew = r

Now, let's try that query again:

MATCH (:Project {name:"Project1"})-[]->(s:Submission {name:"Submission"})-[*]->(x:SubGraph)

That should be much better, but won't quite get you your whole graph, so let's try for something a bit cleaner. Note, why are you matching by project-name and submission-name, and doing nothing with the project? I'll just work from project for now.

MATCH (p:Project {name:"Project1"})-[]->(s:Submission)
WITH p, s
MATCH path=(s)-[]->(x:SubGraph)-[*]-(y:SubGraph)
1 Like

Thank you very much for your reply!

Adding a new label to the whole subgraph has worked out, but seems rather unelegant to me... It's a pity if it wouldn't work out otherwise.
Unfortunately, I won't be able to change the unknown directionality in the subgraphs. These subgraphs are abstract syntax graphs of java code, which means I just use them without any modification.

I'm working with the project and submission name, as there may be many submissions in a certain project. For identifiying the graph of a particular abstract syntax graph it is therefore necessary to match both, project and submission. But your solution of not using the project for path matching seems more elegant and probably allows for faster processing.

Thanks again,

Code is always directional. (:Project)<-[:IN_PROJECT]-(:Component)<-[:DEPENDED_ON]-(:Class)<-[:CALLED_BY]-(:Function)
I'm guessing you mean the graph is automatically generated? If so, I'd love to see what's doing that, and apply it to my own code.

The key to efficient queries, is to reduce the results as early as possible, and avoid N:N paths.

Our whole product is based on this exact concept ( VeriPrism.com) but I can help with a few suggestions. We are pushing around some very large graphs, so we have some practice.

  1. Create the default NodeType. Its not obvious but you will need it later. Its going to get very useful. It also helps to have every node with at least one type defined.
  2. We added compileunit as an attribute so that we could subset the whole graph. At any given time we can have over a 1000 programs. that will cut your operating scope down to less nodes
  3. depending on how you generated your AST, you have directionality. Most produce a .dot file (Graphviz) that has a nodeid->nodeid type syntax. ASTs typically produce the symbol table first (nodes), then the heirarchy. What product did you use to do that.
  4. We added a release code, and state code for each root node. That allowed us to match two patterns. Assuming what you are doing is tracking pattern changes between releases you will need both of those.

To the extent my boss doesn't object, I can probably help with some other things. If you can share your node structure it would help. Or check our VeriPrism.com for some examples.

1 Like

You need to turn on certbot-auto, or setup a cron job.

You're the founder and CTO, you are the boss! ^_^

I seem to have expressed myself unclearly regarding the directionality. Of course there is always a direction between nodes, but which direction can be declared unknown while executing a query on the graph.

The abstract syntax graph is transformed using the AST according to this paper: 10.1007/978-3-319-08657-6_10
The used structure is called a TGraph. Unfortunately, I can't share the software used, as it is still in research of our university.

Thanks a lot for all the recommendations! What I've done is to actually use the software by our university, which generates such a TGraph from program code, and simply translate it into an appropriate structure for Neo4j. So the labels and attributes are already well-defined and sufficient for all kinds of queries. The only challenge is actually the performance optimization in Neo4j, in order to execute queries on arbitrary amounts of graph represantations of programs.
But thanks to your help I've already added an additional label to every node in a subgraph for more efficient matching. Thanks a lot!

You would think so but even I have people who chase me around. LOL

Fair enough. It is a space that interests me (obviously) and while we have a lot of options, people always come up with new ideas. We can create an amazon instance pretty quick or let you into the dev research db. Or I can dump one of the test programs we use so you can see the approach and borrow as you see fit. At any rate, drop me a line. Bill Curtis ( head of CISQ) uses that environment so you may end up with a decent reference as well ( Wiki Bill Curtis and OMG)

That should be much better, but won't quite get you your whole graph, so let's try for something a bit cleaner. Note, why are you matching by project-name and submission-name, and doing nothing with the project? I'll just work from project for now.

So I've now followed your approach and labeled every node in such a subgraph with TGraph. Unfortunately, the query still won't terminate:

MATCH (:Project {name:"Project1}) -->(s:Submission)-->(t:TypeDeclaration)
MATCH path = (s)-->(t) -[*]-(x:TGraph)

Where TypeDeclaration is just the root node of a class. The TypeDeclaration already posseses the TGraph label, while the Submission labeled nodes don't. Adding the TGraph label doesn't seem to have improved the processing time for such a recursive query.
Any other ideas why it may not have worked out as expected? The whole subgraph, which begins with the TypeDeclaratio node, consists of 200 nodes with (seemingly) arbitrary depth and edge direction in between the graph.

It's interesting, that the following query returns all nodes and paths contained in this subgraph (although still fairly inefficient):

MATCH (:Project {name:"Project1}) -->(s:Submission)-->(t:TypeDeclaration)
MATCH path = (s)-->(t) -[*1..11]-(x:TGraph)

So there are actually two questions to this:
Why doesn't [*] terminate, although [1..11] already delivers all possible paths? Meaning, why would the search with [] be more extensive?
And, how can the performance of such a search be optimized? When I want to query 1000s of such subgraphs it will result in an unbearable overhead...

Because you're asking for EVERY possible path. It's an N^n problem.


CREATE (a:Node {name:"A"})-[:REL]->(b:Node {name:"B"})-[:REL]->(c:Node {name:"C"}), (c)-[:REL]->(a);

MATCH p=()-[*]-() RETURN p;     # 18 possible paths

Now, one more node, two more rels...

CREATE (a:Exponential {name:"A"})-[:REL]->(b:Exponential {name:"B"})-[:REL]->(c:Exponential {name:"C"})-[:REL]->(d:Exponential {name:"D"}), (d)-[:REL]->(a), (a)-[:REL]->(c);

MATCH p=(:Exponential)-[*]-(:Exponential) RETURN p    # 86 results, 12 of them 5 hops long. 


You should write your query to ensure it returns all nodes and paths, with the least number of result rows, or at least minimizing duplicates in the results.

The tie-fighter )-[*]-( gives you every possible path, any number of hops, so with large graphs it blows-up quickly. What do you want to do with the result? Just see it in the browser, or do you want to do work?

An additional note, is that if you can specify the relationship name, do, it will make the query faster.

MATCH (:Project {name:"Project1"}) -->(s:Submission)-->(t:TypeDeclaration)
MATCH (t)-[1..11]-(x:TGraph)
WITH t, x
MATCH (t)-[tRx]-(x), (x)-[xRx]-(x)
RETURN t, x, tRx, xRx

How to design a graph for efficient queries

  1. You should write your query to ensure it returns all nodes and paths, with the least number of result rows, or at least minimizing duplicates in the results.
  2. You want the whole graph of all nodes below a :TypeDeclaration node.

Note: Wherever you can rely on directionality do, it's the first method of reducing results.

Since what you're trying to do is "find all the nodes beneath a :TypeDeclaration, and all the relationships between those nodes... build your graph to make the first part of that one hop.

MATCH (:Project {name:"Project1"}) -->(s:Submission)-->(t:TypeDeclaration)
MATCH (t)-[r]-(:TGraph)
SET r :DECLARATION; # preserve original hierarchy for later

MATCH (:Project {name:"Project1"}) -->(s:Submission)-->(t:TypeDeclaration)
MATCH (t)-[1..11]-(x:TGraph)
WITH t, x
CREATE (t)<-[:CLASS]-(x);

Now your queries can be fast

MATCH (:Project {name:"Project1"}) -->(s:Submission)-->(t:TypeDeclaration)<-[:CLASS]-(x:TGraph)
MATCH (x)-[r]-(x)

... and if you still really want all paths, you can do that... but with 200 nodes you're likely looking at millions of paths...

MATCH (:Project {name:"Project1"}) -->(s:Submission)-->(t:TypeDeclaration)<-[:CLASS]-(x:TGraph)
WITH t, x
MATCH p=(t)-[:DECLARATION]-(x)-[*]-(x)

Adding to this is at my own peril but...

I think part of the issue is that you need to step back to the use case a bit more. I read the paper, and the structure its producing ( Annotated AST) is pretty good. But I think you are making some things harder on Neo4j than they need to be, and a lot harder on yourself. Detecting pattern errors is a lot simpler after 20 iterations.... (groan)

I started to write up a lot more but realized it was probably of no interest to the community so if you send me your email, I'll send you some suggestions on design. We just finished a program that was just shy of 10M nodes and the performance was pretty acceptable.


1 Like