cancel
Showing results for 
Search instead for 
Did you mean: 

performance issues with simple match on small graph (lineage analysis, cypher doesn't terminate)

harmonie_shorts
Node Link

Hi, I've just started with a prestudy project for a data lineage use case.

Right now I'm using a rather simple graph of Data- and Process-Objects having impact on each other (the graph model consists of a node "Object" with one self-reference "has_impact_on").

There are only 1,548 Object-nodes and 3,352 has_impact_on-references in the graph (although with some cycles), but my queries don't terminate, e.g.:

MATCH p=(n)-[*]->()
where n.Obj_MEMBER = 'CT_MAPPING'
RETURN p

MATCH p=(n)-[*]->(m)
where n.Obj_MEMBER = 'CT_MAPPING'
RETURN m.Obj_TYPE_CAT, m.Obj_MEMBER, length(p), [x in nodes(p)|x.Obj_ID]

Could the cycles be the problem? Are there better alternatives, or is it because I'm currently testing on a free aura instance?

Thanks a lot for your help.

Best regards, Thomas

6 REPLIES 6

dana_canzano
Neo4j
Neo4j

@harmonie_shorts

given the usage of

where n.Obj_MEMBER = 'CT_MAPPING'

this would lend itself to indexing Obj_MEMBER. But to do so w would need to know the label this property is associated with. And for example if this property `Obj_MEMBER` and if asssociated with a label named 'Label101` then create an index as

create index :Label101(Obj_Member);

``and then also change the query to reference this label and thus

MATCH p=(n:Label101)-[*]->()
where n.Obj_MEMBER = 'CT_MAPPING'
RETURN p

your current cypher is going to walk the entire graph and expand all paths.

Thanks a lot, the Nodes' label is "Object", so I've created an Index using this statement:


CREATE INDEX IdxObjectMember
FOR (n:Object)
ON (n.Obj_MEMBER)

, then I slightly modified my query by specifying n, following your recommendation (I hope):

MATCH p=(n:Object)-[*]->()
where n.Obj_MEMBER = 'CT_MAPPING'
RETURN p LIMIT 1000

May I ask another Noob-Question, because I'm a little bit confused:
As far as I've understood, neo4j uses simple pointer operations for traversing the Graph-Relationships.
Why should an index on Nodes improve the execution, if - as in my case - a unique starting point for the traversal is provided (here: Object.Object_MEMBER = 'CT_MAPPING'?

Thanks a lot for your help and understanding 🙂

@harmonie_shorts the original

MATCH p=(n)-[*]->()
where n.Obj_MEMBER = 'CT_MAPPING'
RETURN p

effectively says `match p=(n) .....` i.e. look at every node. If your graph has 100 million nodes but only 20k are with label `:Object` then we look at all 100 million nodes. Whereas a `match p=(n:Object)....` says only look at the 20k nodes with a label of `:Object'.

Further if you create a index on `:Object', and as described in prior posts above, then we use the index to find which nodes have said property/value per index. And since this index is in some b-tree like form if you have 20k `:Object' nodes even traversing to the bottom leaf of the 20k index could take a significantly small number of comparisons (i.e with 1 comparison you can eliminate 10k entries, 2 comparisons you can eliminate an additional 5k, 3 comparisons, another 2.5k etc)

hmm... but I've only got 1,550 Object-Nodes and 3,350 relationships?

Here's the simple Graph Model with query and Index, it's been running for 4,5 hours now.
What's wrong?:
neo4j_lineage_performance_issue.jpg

Here's an explain query:

neo4j_lineage_performance_issue_explain.jpg

PS: it didn't quite work as expected, either my way is still incorrect, or something else is missing:
This cypher still hasn't terminated after more than 1 hour:

MATCH p=(n:Object)-[*]->(m)
where n.Obj_MEMBER = 'CT_MAPPING'
RETURN m.Obj_TYPE_CAT, m.Obj_MEMBER, length(p), [x in nodes(p)|x.Obj_ID] LIMIT 100000
?