cancel
Showing results forΒ
Did you mean:Β

## Find same sequences of nodes in a graph with multiple chains of nodes (Sequential Pattern Mining algorithm)

Graph Buddy

Given a graph with several letters with every letter modelled as a chain of word nodes connected with NEXT-edges. Is there a way to find sequences of nodes of a given length covering the same text without setting any start node for the query ?
Bests
Andreas

31 REPLIES 31
Ninja

Hello @andreas.kuczera

Are you looking for a Sequential Pattern Mining algorithm?

Regards,
Cobra

Graph Buddy

Yes, sounds good to me π

Ninja

I'm also looking for one ...

I asked Neo4j if they were going to add any in GDS in the future but they told me it was not planned. I know there is a pattern mining library in R but I am looking for a Cypher solution. I haven't started to think about how to do it, but maybe we can start to find a Cypher solution here.

Graph Buddy

Yes, we can start here.
My first try was to query with fixes path length.

Ninja

What is your current cypher query?

Graph Buddy
``````// 4 Gleiche
match (t1:Token)-[:NEXT_TOKEN]->
(t2:Token)-[:NEXT_TOKEN]->(t3:Token)-[:NEXT_TOKEN]->(t4:Token)
match (b1:Token)-[:NEXT_TOKEN]->
(b2:Token)-[:NEXT_TOKEN]->(b3:Token)-[:NEXT_TOKEN]->(b4:Token)
where t1.text = b1.text
and t2.text = b2.text
and t3.text = b3.text
and t4.text = b4.text
and t1.Briefnr <> b1.Briefnr
return t1.Briefnr, b2.Briefnr, b1.text, b2.text, b3.text, b4.text;

``````

Runs forever ...

Graph Fellow

Is there any scheduler on the instance? It's not running right now

Bennu

Graph Buddy

No i stopped it. Can you reach the DB ?

Graph Buddy

Works again now here: http://sandstormserv.mni.thm.de:7474/browser/

Ninja

I found a few topics which could be interested:

I understand the logic but I wonder if it is optimize for big graphs

Graph Buddy

There was an approach by Stefan Armbruster but it wasn't merged to apoc: Commits Β· sarmbruster/neo4j-apoc-procedures Β· GitHub

Ninja

I don't know if I'm ready to develop my own plugin but it will definitely be quicker in java than in full cypher.

Graph Buddy

Cypher might be to slow ... thats why apoc could be a good solution.

Graph Fellow

Hi all!

If you have a small portion of the data with some examples I may like helping with a Pregel Implementation.

Bennu

Hello @Bennu

Thank you, I will try to create a good dataset this afternoon

Regards,
Cobra

Ninja

Looks like someone developed its own plugin with the Apriori algorithm.

Ninja

You can use this dataset from this source, it's a little social network dataset. Here is the cypher queries to create it:

``````CREATE (nAlice:User {id:'Alice', name:'Node Alice'})
,(nBridget:User {id:'Bridget', name:'Node Bridget'})
,(nCharles:User {id:'Charles', name:'Node Charles'})
,(nDoug:User {id:'Doug', name:'Node Doug'})
,(nMark:User {id:'Mark', name:'Node Mark'})
,(nNancy:User {id:'Nancy', name:'Node Nancy'})
,(nOliver:User {id:'Oliver', name:'Node Oliver'})

CREATE (nAlice)-[:FOLLOWS]->(nBridget)
,(nAlice)-[:FOLLOWS]->(nCharles)
,(nAlice)-[:FOLLOWS]->(nDoug)
,(nBridget)-[:FOLLOWS]->(nAlice)
,(nBridget)-[:FOLLOWS]->(nCharles)
,(nBridget)-[:FOLLOWS]->(nDoug)
,(nCharles)-[:FOLLOWS]->(nAlice)
,(nCharles)-[:FOLLOWS]->(nBridget)
,(nCharles)-[:FOLLOWS]->(nDoug)
,(nDoug)-[:FOLLOWS]->(nAlice)
,(nDoug)-[:FOLLOWS]->(nBridget)
,(nDoug)-[:FOLLOWS]->(nCharles)
,(nMark)-[:FOLLOWS]->(nNancy)
,(nMark)-[:FOLLOWS]->(nOliver)
,(nNancy)-[:FOLLOWS]->(nMark)
,(nNancy)-[:FOLLOWS]->(nOliver)
,(nOliver)-[:FOLLOWS]->(nMark)
,(nOliver)-[:FOLLOWS]->(nNancy)
,(nCharles)-[:FOLLOWS]->(nMark)
,(nMark)-[:FOLLOWS]->(nCharles)

CREATE (msg1:Message {id:'1', contents:'This is a message', day_sent: 'Monday'})
,(msg2:Message {id:'2', contents:'This is a message', day_sent: 'Monday'})
,(msg3:Message {id:'3', contents:'This is a message', day_sent: 'Monday'})
,(msg4:Message {id:'4', contents:'This is a message', day_sent: 'Monday'})
,(msg5:Message {id:'5', contents:'This is a message', day_sent: 'Monday'})
,(msg6:Message {id:'6', contents:'This is a message', day_sent: 'Monday'})
,(msg7:Message {id:'7', contents:'This is a message', day_sent: 'Monday'})
,(msg8:Message {id:'8', contents:'This is a message', day_sent: 'Monday'})
,(msg9:Message {id:'9', contents:'This is a message', day_sent: 'Tuesday'})
,(msg10:Message {id:'10', contents:'This is a message', day_sent: 'Tuesday'})
,(msg11:Message {id:'11', contents:'This is a message', day_sent: 'Tuesday'})
,(msg12:Message {id:'12', contents:'This is a message', day_sent: 'Tuesday'})
,(msg13:Message {id:'13', contents:'This is a message', day_sent: 'Tuesday'})
,(msg14:Message {id:'14', contents:'This is a message', day_sent: 'Monday'})
,(msg15:Message {id:'15', contents:'This is a message', day_sent: 'Thursday'})
,(msg16:Message {id:'16', contents:'This is a message', day_sent: 'Thursday'})
,(msg17:Message {id:'17', contents:'This is a message', day_sent: 'Thursday'})
,(msg18:Message {id:'18', contents:'This is a message', day_sent: 'Thursday'})
,(msg19:Message {id:'19', contents:'This is a message', day_sent: 'Thursday'})
,(msg20:Message {id:'20', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg21:Message {id:'21', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg22:Message {id:'22', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg23:Message {id:'23', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg35:Message {id:'35', contents:'Conversation Starter', day_sent: 'Saturday'})
,(msg36:Message {id:'36', contents:'Conversation Starter', day_sent: 'Saturday'})
,(msg54:Message {id:'54', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg55:Message {id:'55', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg56:Message {id:'56', contents:'Conversation Starter', day_sent: 'Sunday'})
,(nAlice)-[:SENT]->(msg2)
,(nBridget)-[:SENT]->(msg3)
,(nBridget)-[:SENT]->(msg4)
,(nBridget)-[:SENT]->(msg5)
,(nCharles)-[:SENT]->(msg6)
,(nDoug)-[:SENT]->(msg7)
,(nDoug)-[:SENT]->(msg8)
,(nAlice)-[:SENT]->(msg9)
,(nBridget)-[:SENT]->(msg10)
,(nCharles)-[:SENT]->(msg11)
,(nDoug)-[:SENT]->(msg12)
,(nDoug)-[:SENT]->(msg13)
,(nAlice)-[:SENT]->(msg14)
,(nAlice)-[:SENT]->(msg15)
,(nAlice)-[:SENT]->(msg16)
,(nAlice)-[:SENT]->(msg17)
,(nBridget)-[:SENT]->(msg18)
,(nCharles)-[:SENT]->(msg19)
,(nBridget)-[:SENT]->(msg20)
,(nBridget)-[:SENT]->(msg21)
,(nNancy)-[:SENT]->(msg22)
,(nMark)-[:SENT]->(msg23)
,(nAlice)-[:SENT]->(msg24)
,(nBridget)-[:SENT]->(msg25)
,(nAlice)-[:SENT]->(msg26)
,(nBridget)-[:SENT]->(msg27)
,(nAlice)-[:SENT]->(msg28)
,(nNancy)-[:SENT]->(msg29)
,(nMark)-[:SENT]->(msg30)
,(nAlice)-[:SENT]->(msg31)
,(nBridget)-[:SENT]->(msg32)
,(nBridget)-[:SENT]->(msg33)
,(nMark)-[:SENT]->(msg34)
,(nMark)-[:SENT]->(msg35)
,(nAlice)-[:SENT]->(msg36)
,(nAlice)-[:SENT]->(msg37)
,(nBridget)-[:SENT]->(msg38)
,(nMark)-[:SENT]->(msg39)
,(nMark)-[:SENT]->(msg40)
,(nBridget)-[:SENT]->(msg41)
,(nCharles)-[:SENT]->(msg42)
,(nBridget)-[:SENT]->(msg43)
,(nAlice )-[:SENT]->(msg44)
,(nCharles)-[:SENT]->(msg45)
,(nDoug)-[:SENT]->(msg46)
,(nDoug)-[:SENT]->(msg47)
,(nMark)-[:SENT]->(msg48)
,(nAlice)-[:SENT]->(msg49)
,(nMark)-[:SENT]->(msg50)
,(nAlice)-[:SENT]->(msg51)
,(nBridget)-[:SENT]->(msg52)
,(nAlice)-[:SENT]->(msg53)
,(nAlice)-[:SENT]->(msg54)
,(nAlice)-[:SENT]->(msg55)
,(nAlice)-[:SENT]->(msg56)
,(nCharles)-[:SENT]->(msg57)
,(nAlice)-[:SENT]->(msg58)
,(nCharles)-[:SENT]->(msg59)
,(nAlice)-[:SENT]->(msg60)
,(nCharles)-[:SENT]->(msg61)
,(nCharles)-[:SENT]->(msg62)
,(nBridget)-[:SENT]->(msg63)
,(nCharles)-[:SENT]->(msg64)
,(nMark)-[:SENT]->(msg65)
,(nMark)-[:SENT]->(msg66)
,(nCharles)-[:SENT]->(msg67)
,(nBridget)-[:SENT]->(msg68)
,(nCharles)-[:SENT]->(msg69)
CREATE (msg5)-[:FORWARD]->(msg2)
,(msg6)-[:FORWARD]->(msg2)
,(msg7)-[:FORWARD]->(msg3)
,(msg8)-[:FORWARD]->(msg1)
,(msg11)-[:FORWARD]->(msg10)
,(msg12)-[:FORWARD]->(msg10)
,(msg13)-[:FORWARD]->(msg11)
,(msg14)-[:FORWARD]->(msg3)
,(msg15)-[:FORWARD]->(msg4)
,(msg16)-[:FORWARD]->(msg5)
,(msg17)-[:FORWARD]->(msg10)
,(msg18)-[:FORWARD]->(msg6)
,(msg46)-[:FORWARD]->(msg39)
,(msg47)-[:FORWARD]->(msg40)

``````

I think there are three output possibilities:

• most frequent node sequence
• most frequent sequence of relationships
• most frequent path sequence (nodes & relations)
Graph Buddy

I have a running graphDB here: http://jlu-buster.mni.thm.de:9580/browser/
There are chains of Token-Nodes which show the Words of letters.
The task is now to show sequences in these chains of the same pattern in the property text.

Graph Buddy

Hi Andreas,

I suggest to first prepare the data properly so that it fully can benefit from the power of the graph. Actually you do mainly text searches.

My suggestions are:

1.) There are 13840 Token with text equals " " - if not needed, delete them

2.) All text properties I have seen, ends with a space - if not needed, strip those spaces

3.) There was no index on the text property, I created one using CREATE INDEX ON :Token(text)

4.) The property BriefNr that is used in a condition in your cypher doesn't exist in the Token nodes

5.) There are many duplicates texts (15611 unique texts, 11319 multiple used texts blown up to 145427 nodes). I would suggest not using duplicate Tokens but just one 1 unique Token per "word". Then you could relate those unique Tokens with relations storing the source of the relation (BriefNr, startIndex, endIndex) on the relation.

With data structured like this, you could easily run a query that shows similar paths in seconds, because similar senctences will traverse exactly the same nodes in the path. Also you will not longer rely on slow heavy text searches.

Best,
Reiner

Graph Buddy

Ok, i changed the Graph accordingly. Now every Token has a unique value-Property.
Does this work for your query ?
Best,
Andreas

Graph Buddy

There was a Token with value "" that would interfere the results. Just tried to delete it with

`match (t:Token) where coalesce(t.value,"") = "" detach delete t`

but either that crashed the db or someone stopped it.

My approach then would have been something like this, but I couldn't test it anymore

``````MATCH (t1:Token)-[r1:NEXT]->(t2:Token)-[r2:NEXT]->(t3:Token)-[r3:NEXT]->(t4:Token)
where r1.Briefnr = r2.Briefnr = r3.Briefnr
with t1.value as v1, t2.value as v2, t3.value as v3, t4.value as v4, collect(r1.Briefnr) as Briefe
return v1, v2, v3, v4, Briefe
order by v1, v2, v3, v4
``````

Best,
Reiner

Graph Buddy

A more graph-like approach would be something like this:

``````MATCH path = ((n:Token)-[:NEXT*4]-(m:Token))
WHERE [x IN relationships(path) | x.Briefnr] = head(relationships(path)).Briefnr
WITH nodes(path) as n, head(relationships(path)).Briefnr as Brief
WITH n, COLLECT(Brief) as Briefe
WITH [x in n | x.value] as x, Briefe
RETURN *
````````
Graph Buddy

I transfered the db to another server:
http://sandstormserv.mni.thm.de:7474/browser/
login is the same. Maybe you can test it here ?
Best
Andreas

The DB now contains only the Text and the Token Nodes.

Node Clone

Hi Andreas,
I just quickly looked into your sample data and discovered, that you have multiple NEXT relationship which are redundant. I suggest to distill them down to just one.

Graph Buddy

Its not a bug, its a feature. Every edge has a property with the letternumber to get the textual contexts.

Graph Buddy

The following cypher working well for the first 7 letters (limit 7), but when the eights is included, this results in 7 million additional paths that will take too long to come a result.

Also that node with "" is still included.

``````match (:Token)-[r:NEXT]->(:Token)
with distinct r.Briefnr as Brief
limit 7
MATCH path = ((n:Token)-[:NEXT*3 {Briefnr:Brief}]-(m:Token))
with Brief, path
WITH distinct nodes(path) as n, Brief
WITH n, COLLECT(Brief) as Briefe
WHERE size(Briefe) > 1
WITH  [x in n | x.value] as x, Briefe
RETURN x, Briefe
``````

I'm sorry not beeing able to invest more time here.

Best,
Reiner

Graph Buddy

The "" seems to be a linebreak in our data and we have to adress this. Thanks for this starting point !

Node Clone

Ok. But you don't need a relationship for every context. You could have one relationship with an array of contexts. The sheer number of relationships leads to a combinatorial explosion with the effect that Reiner's cypher query times out.
When aggregating the context in one relationship and with a BFS-style query you should be able to achieve what you want.

Graph Buddy

Ok. With this query i create the tokens:

``````MATCH (t:Text)
WITH t.Value as sentence, t.Name as b
WITH split(sentence," ") as words, b
FOREACH ( idx IN range(0,size(words)-2) |
MERGE (w1:Token {value:trim(words[idx])})
MERGE (w2:Token {value:trim(words[idx+1])})
CREATE (w1)-[r:NEXT {Briefnr:b}]->(w2));
``````

I changed it this way:

``````// Property as Array with one Relation
MATCH (t:Text)
WITH t.Value as sentence, t.Name as b
WITH split(sentence," ") as words, b
FOREACH ( idx IN range(0,size(words)-2) |
MERGE (w1:Token {value:trim(words[idx])})
MERGE (w2:Token {value:trim(words[idx+1])})
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.list = b
ON MATCH SET r.list = r.list + ", " + b);
``````

Now we have a list of letter numbers on the relationships. But how can i find common chains of Tokens of length 5 of two letters ?

My colleague @liebig figured out that with the above modell its not possible to recreate the letters texts from the graph as outgoing NEXT-edges could be ambigue. We will now try to create an explicit modell of the text with one node for every Token from a letter.

Nodes 2022

NODES 2022, Neo4j Online Education Summit

OnΒ November 16 and 17 for 24 hours across all timezones, youβll learn about best practices for beginners and experts alike.

Neo4j Resources