cancel
Showing results for 
Search instead for 
Did you mean: 

Join the community at Nodes 2022, our free virtual event on November 16 - 17.

Find same sequences of nodes in a graph with multiple chains of nodes (Sequential Pattern Mining algorithm)

andreas_kuczera
Graph Buddy

Given a graph with several letters with every letter modelled as a chain of word nodes connected with NEXT-edges. Is there a way to find sequences of nodes of a given length covering the same text without setting any start node for the query ?
Bests
Andreas

31 REPLIES 31

Cobra
Ninja
Ninja

Hello @andreas.kuczera

Are you looking for a Sequential Pattern Mining algorithm?

Regards,
Cobra

andreas_kuczera
Graph Buddy

Yes, sounds good to me 😉

Cobra
Ninja
Ninja

I'm also looking for one ...

I asked Neo4j if they were going to add any in GDS in the future but they told me it was not planned. I know there is a pattern mining library in R but I am looking for a Cypher solution. I haven't started to think about how to do it, but maybe we can start to find a Cypher solution here.

Yes, we can start here.
My first try was to query with fixes path length.

Cobra
Ninja
Ninja

What is your current cypher query?

// 4 Gleiche
match (t1:Token)-[:NEXT_TOKEN]->
(t2:Token)-[:NEXT_TOKEN]->(t3:Token)-[:NEXT_TOKEN]->(t4:Token)
match (b1:Token)-[:NEXT_TOKEN]->
(b2:Token)-[:NEXT_TOKEN]->(b3:Token)-[:NEXT_TOKEN]->(b4:Token)
where t1.text = b1.text
and t2.text = b2.text
and t3.text = b3.text
and t4.text = b4.text
and t1.Briefnr <> b1.Briefnr
return t1.Briefnr, b2.Briefnr, b1.text, b2.text, b3.text, b4.text;

Runs forever ...

Hi @andreas.kuczera !

Is there any scheduler on the instance? It's not running right now

Bennu

No i stopped it. Can you reach the DB ?

Works again now here: http://sandstormserv.mni.thm.de:7474/browser/
Login is the same 😉

andreas_kuczera
Graph Buddy

There was an approach by Stefan Armbruster but it wasn't merged to apoc: Commits · sarmbruster/neo4j-apoc-procedures · GitHub

Cobra
Ninja
Ninja

I don't know if I'm ready to develop my own plugin but it will definitely be quicker in java than in full cypher.

Cypher might be to slow ... thats why apoc could be a good solution.

Hi all!

If you have a small portion of the data with some examples I may like helping with a Pregel Implementation.

Bennu

Hello @Bennu

Thank you, I will try to create a good dataset this afternoon

Regards,
Cobra

Cobra
Ninja
Ninja

Looks like someone developed its own plugin with the Apriori algorithm.

Cobra
Ninja
Ninja

You can use this dataset from this source, it's a little social network dataset. Here is the cypher queries to create it:

CREATE (nAlice:User {id:'Alice', name:'Node Alice'})
,(nBridget:User {id:'Bridget', name:'Node Bridget'})
,(nCharles:User {id:'Charles', name:'Node Charles'})
,(nDoug:User {id:'Doug', name:'Node Doug'})
,(nMark:User {id:'Mark', name:'Node Mark'})
,(nNancy:User {id:'Nancy', name:'Node Nancy'})
,(nOliver:User {id:'Oliver', name:'Node Oliver'})

CREATE (nAlice)-[:FOLLOWS]->(nBridget)
,(nAlice)-[:FOLLOWS]->(nCharles)
,(nAlice)-[:FOLLOWS]->(nDoug)
,(nBridget)-[:FOLLOWS]->(nAlice)
,(nBridget)-[:FOLLOWS]->(nCharles)
,(nBridget)-[:FOLLOWS]->(nDoug)
,(nCharles)-[:FOLLOWS]->(nAlice)
,(nCharles)-[:FOLLOWS]->(nBridget)
,(nCharles)-[:FOLLOWS]->(nDoug)
,(nDoug)-[:FOLLOWS]->(nAlice)
,(nDoug)-[:FOLLOWS]->(nBridget)
,(nDoug)-[:FOLLOWS]->(nCharles)
,(nMark)-[:FOLLOWS]->(nNancy)
,(nMark)-[:FOLLOWS]->(nOliver)
,(nNancy)-[:FOLLOWS]->(nMark)
,(nNancy)-[:FOLLOWS]->(nOliver)
,(nOliver)-[:FOLLOWS]->(nMark)
,(nOliver)-[:FOLLOWS]->(nNancy)
,(nCharles)-[:FOLLOWS]->(nMark)
,(nMark)-[:FOLLOWS]->(nCharles)

CREATE (msg1:Message {id:'1', contents:'This is a message', day_sent: 'Monday'})
,(msg2:Message {id:'2', contents:'This is a message', day_sent: 'Monday'})
,(msg3:Message {id:'3', contents:'This is a message', day_sent: 'Monday'})
,(msg4:Message {id:'4', contents:'This is a message', day_sent: 'Monday'})
,(msg5:Message {id:'5', contents:'This is a message', day_sent: 'Monday'})
,(msg6:Message {id:'6', contents:'This is a message', day_sent: 'Monday'})
,(msg7:Message {id:'7', contents:'This is a message', day_sent: 'Monday'})
,(msg8:Message {id:'8', contents:'This is a message', day_sent: 'Monday'})
,(msg9:Message {id:'9', contents:'This is a message', day_sent: 'Tuesday'})
,(msg10:Message {id:'10', contents:'This is a message', day_sent: 'Tuesday'})
,(msg11:Message {id:'11', contents:'This is a message', day_sent: 'Tuesday'})
,(msg12:Message {id:'12', contents:'This is a message', day_sent: 'Tuesday'})
,(msg13:Message {id:'13', contents:'This is a message', day_sent: 'Tuesday'})
,(msg14:Message {id:'14', contents:'This is a message', day_sent: 'Monday'})
,(msg15:Message {id:'15', contents:'This is a message', day_sent: 'Thursday'})
,(msg16:Message {id:'16', contents:'This is a message', day_sent: 'Thursday'})
,(msg17:Message {id:'17', contents:'This is a message', day_sent: 'Thursday'})
,(msg18:Message {id:'18', contents:'This is a message', day_sent: 'Thursday'})
,(msg19:Message {id:'19', contents:'This is a message', day_sent: 'Thursday'})
,(msg20:Message {id:'20', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg21:Message {id:'21', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg22:Message {id:'22', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg23:Message {id:'23', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg24:Message {id:'24', contents:'Reply', day_sent: 'Friday'})
,(msg25:Message {id:'25', contents:'Reply', day_sent: 'Friday'})
,(msg26:Message {id:'26', contents:'Reply', day_sent: 'Friday'})
,(msg27:Message {id:'27', contents:'Reply', day_sent: 'Friday'})
,(msg28:Message {id:'28', contents:'Reply', day_sent: 'Friday'})
,(msg29:Message {id:'29', contents:'Reply', day_sent: 'Friday'})
,(msg30:Message {id:'30', contents:'Reply', day_sent: 'Friday'})
,(msg31:Message {id:'31', contents:'Reply', day_sent: 'Friday'})
,(msg32:Message {id:'32', contents:'Reply', day_sent: 'Friday'})
,(msg33:Message {id:'33', contents:'Reply', day_sent: 'Friday'})
,(msg34:Message {id:'34', contents:'Reply', day_sent: 'Friday'})
,(msg35:Message {id:'35', contents:'Conversation Starter', day_sent: 'Saturday'})
,(msg36:Message {id:'36', contents:'Conversation Starter', day_sent: 'Saturday'})
,(msg37:Message {id:'37', contents:'Reply', day_sent: 'Saturday'})
,(msg38:Message {id:'38', contents:'Reply', day_sent: 'Saturday'})
,(msg39:Message {id:'39', contents:'Reply', day_sent: 'Saturday'})
,(msg40:Message {id:'40', contents:'Reply', day_sent: 'Saturday'})
,(msg41:Message {id:'41', contents:'Reply', day_sent: 'Saturday'})
,(msg42:Message {id:'42', contents:'Reply', day_sent: 'Saturday'})
,(msg43:Message {id:'43', contents:'Reply', day_sent: 'Saturday'})
,(msg44:Message {id:'44', contents:'Reply', day_sent: 'Saturday'})
,(msg45:Message {id:'45', contents:'Reply', day_sent: 'Saturday'})
,(msg46:Message {id:'46', contents:'Reply', day_sent: 'Saturday'})
,(msg47:Message {id:'47', contents:'Reply', day_sent: 'Saturday'})
,(msg48:Message {id:'48', contents:'Reply', day_sent: 'Saturday'})
,(msg49:Message {id:'49', contents:'Reply', day_sent: 'Sunday'})
,(msg50:Message {id:'50', contents:'Reply', day_sent: 'Sunday'})
,(msg51:Message {id:'51', contents:'Reply', day_sent: 'Sunday'})
,(msg52:Message {id:'52', contents:'Reply', day_sent: 'Sunday'})
,(msg53:Message {id:'53', contents:'Reply', day_sent: 'Sunday'})
,(msg54:Message {id:'54', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg55:Message {id:'55', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg56:Message {id:'56', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg57:Message {id:'57', contents:'Reply', day_sent: 'Sunday'})
,(msg58:Message {id:'58', contents:'Reply', day_sent: 'Sunday'})
,(msg59:Message {id:'59', contents:'Reply', day_sent: 'Sunday'})
,(msg60:Message {id:'60', contents:'Reply', day_sent: 'Sunday'})
,(msg61:Message {id:'61', contents:'Reply', day_sent: 'Sunday'})
,(msg62:Message {id:'62', contents:'Reply', day_sent: 'Sunday'})
,(msg63:Message {id:'63', contents:'Reply', day_sent: 'Sunday'})
,(msg64:Message {id:'64', contents:'Reply', day_sent: 'Sunday'})
,(msg65:Message {id:'65', contents:'Reply', day_sent: 'Sunday'})
,(msg66:Message {id:'66', contents:'Reply', day_sent: 'Sunday'})
,(msg67:Message {id:'67', contents:'Reply', day_sent: 'Sunday'})
,(msg68:Message {id:'68', contents:'Reply', day_sent: 'Sunday'})
,(msg69:Message {id:'69', contents:'Reply', day_sent: 'Sunday'})
,(nAlice)-[:SENT]->(msg2)
,(nBridget)-[:SENT]->(msg3)
,(nBridget)-[:SENT]->(msg4)
,(nBridget)-[:SENT]->(msg5)
,(nCharles)-[:SENT]->(msg6)
,(nDoug)-[:SENT]->(msg7)
,(nDoug)-[:SENT]->(msg8)
,(nAlice)-[:SENT]->(msg9)
,(nBridget)-[:SENT]->(msg10)
,(nCharles)-[:SENT]->(msg11)
,(nDoug)-[:SENT]->(msg12)
,(nDoug)-[:SENT]->(msg13)
,(nAlice)-[:SENT]->(msg14)
,(nAlice)-[:SENT]->(msg15)
,(nAlice)-[:SENT]->(msg16)
,(nAlice)-[:SENT]->(msg17)
,(nBridget)-[:SENT]->(msg18)
,(nCharles)-[:SENT]->(msg19)
,(nBridget)-[:SENT]->(msg20)
,(nBridget)-[:SENT]->(msg21)
,(nNancy)-[:SENT]->(msg22)
,(nMark)-[:SENT]->(msg23)
,(nAlice)-[:SENT]->(msg24)
,(nBridget)-[:SENT]->(msg25)
,(nAlice)-[:SENT]->(msg26)
,(nBridget)-[:SENT]->(msg27)
,(nAlice)-[:SENT]->(msg28)
,(nNancy)-[:SENT]->(msg29)
,(nMark)-[:SENT]->(msg30)
,(nAlice)-[:SENT]->(msg31)
,(nBridget)-[:SENT]->(msg32)
,(nBridget)-[:SENT]->(msg33)
,(nMark)-[:SENT]->(msg34)
,(nMark)-[:SENT]->(msg35)
,(nAlice)-[:SENT]->(msg36)
,(nAlice)-[:SENT]->(msg37)
,(nBridget)-[:SENT]->(msg38)
,(nMark)-[:SENT]->(msg39)
,(nMark)-[:SENT]->(msg40)
,(nBridget)-[:SENT]->(msg41)
,(nCharles)-[:SENT]->(msg42)
,(nBridget)-[:SENT]->(msg43)
,(nAlice )-[:SENT]->(msg44)
,(nCharles)-[:SENT]->(msg45)
,(nDoug)-[:SENT]->(msg46)
,(nDoug)-[:SENT]->(msg47)
,(nMark)-[:SENT]->(msg48)
,(nAlice)-[:SENT]->(msg49)
,(nMark)-[:SENT]->(msg50)
,(nAlice)-[:SENT]->(msg51)
,(nBridget)-[:SENT]->(msg52)
,(nAlice)-[:SENT]->(msg53)
,(nAlice)-[:SENT]->(msg54)
,(nAlice)-[:SENT]->(msg55)
,(nAlice)-[:SENT]->(msg56)
,(nCharles)-[:SENT]->(msg57)
,(nAlice)-[:SENT]->(msg58)
,(nCharles)-[:SENT]->(msg59)
,(nAlice)-[:SENT]->(msg60)
,(nCharles)-[:SENT]->(msg61)
,(nCharles)-[:SENT]->(msg62)
,(nBridget)-[:SENT]->(msg63)
,(nCharles)-[:SENT]->(msg64)
,(nMark)-[:SENT]->(msg65)
,(nMark)-[:SENT]->(msg66)
,(nCharles)-[:SENT]->(msg67)
,(nBridget)-[:SENT]->(msg68)
,(nCharles)-[:SENT]->(msg69)
CREATE (msg5)-[:FORWARD]->(msg2)
,(msg6)-[:FORWARD]->(msg2)
,(msg7)-[:FORWARD]->(msg3)
,(msg8)-[:FORWARD]->(msg1)
,(msg11)-[:FORWARD]->(msg10)
,(msg12)-[:FORWARD]->(msg10)
,(msg13)-[:FORWARD]->(msg11)
,(msg14)-[:FORWARD]->(msg3)
,(msg15)-[:FORWARD]->(msg4)
,(msg16)-[:FORWARD]->(msg5)
,(msg17)-[:FORWARD]->(msg10)
,(msg18)-[:FORWARD]->(msg6)
,(msg46)-[:FORWARD]->(msg39)
,(msg47)-[:FORWARD]->(msg40)

CREATE (msg24)-[:REPLY_TO]->(msg20)
,(msg25)-[:REPLY_TO]->(msg24)
,(msg26)-[:REPLY_TO]->(msg25)
,(msg27)-[:REPLY_TO]->(msg26)
,(msg28)-[:REPLY_TO]->(msg27)
,(msg29)-[:REPLY_TO]->(msg21)
,(msg30)-[:REPLY_TO]->(msg21)
,(msg31)-[:REPLY_TO]->(msg21)
,(msg32)-[:REPLY_TO]->(msg28)
,(msg33)-[:REPLY_TO]->(msg30)
,(msg34)-[:REPLY_TO]->(msg33)
,(msg37)-[:REPLY_TO]->(msg35)
,(msg38)-[:REPLY_TO]->(msg35)
,(msg39)-[:REPLY_TO]->(msg37)
,(msg40)-[:REPLY_TO]->(msg38)
,(msg41)-[:REPLY_TO]->(msg38)
,(msg42)-[:REPLY_TO]->(msg37)
,(msg43)-[:REPLY_TO]->(msg37)
,(msg44)-[:REPLY_TO]->(msg43)
,(msg45)-[:REPLY_TO]->(msg42)
,(msg48)-[:REPLY_TO]->(msg36)
,(msg49)-[:REPLY_TO]->(msg48)
,(msg50)-[:REPLY_TO]->(msg49)
,(msg51)-[:REPLY_TO]->(msg41)
,(msg52)-[:REPLY_TO]->(msg50)
,(msg53)-[:REPLY_TO]->(msg52)
,(msg57)-[:REPLY_TO]->(msg54)
,(msg58)-[:REPLY_TO]->(msg57)
,(msg59)-[:REPLY_TO]->(msg58)
,(msg60)-[:REPLY_TO]->(msg59)
,(msg61)-[:REPLY_TO]->(msg60)
,(msg62)-[:REPLY_TO]->(msg55)
,(msg63)-[:REPLY_TO]->(msg62)
,(msg64)-[:REPLY_TO]->(msg63)
,(msg65)-[:REPLY_TO]->(msg64)
,(msg66)-[:REPLY_TO]->(msg56)
,(msg67)-[:REPLY_TO]->(msg66)
,(msg68)-[:REPLY_TO]->(msg67)
,(msg69)-[:REPLY_TO]->(msg68)

I think there are three output possibilities:

  • most frequent node sequence
  • most frequent sequence of relationships
  • most frequent path sequence (nodes & relations)

andreas_kuczera
Graph Buddy

I have a running graphDB here: http://jlu-buster.mni.thm.de:9580/browser/
Login is neo4j/1234
There are chains of Token-Nodes which show the Words of letters.
The task is now to show sequences in these chains of the same pattern in the property text.

Reiner
Graph Buddy

Hi Andreas,

I suggest to first prepare the data properly so that it fully can benefit from the power of the graph. Actually you do mainly text searches.

My suggestions are:

1.) There are 13840 Token with text equals " " - if not needed, delete them

2.) All text properties I have seen, ends with a space - if not needed, strip those spaces

3.) There was no index on the text property, I created one using CREATE INDEX ON :Token(text)

4.) The property BriefNr that is used in a condition in your cypher doesn't exist in the Token nodes

5.) There are many duplicates texts (15611 unique texts, 11319 multiple used texts blown up to 145427 nodes). I would suggest not using duplicate Tokens but just one 1 unique Token per "word". Then you could relate those unique Tokens with relations storing the source of the relation (BriefNr, startIndex, endIndex) on the relation.

With data structured like this, you could easily run a query that shows similar paths in seconds, because similar senctences will traverse exactly the same nodes in the path. Also you will not longer rely on slow heavy text searches.

Best,
Reiner

Ok, i changed the Graph accordingly. Now every Token has a unique value-Property.
Does this work for your query ?
Best,
Andreas

Reiner
Graph Buddy

There was a Token with value "" that would interfere the results. Just tried to delete it with

match (t:Token) where coalesce(t.value,"") = "" detach delete t

but either that crashed the db or someone stopped it.

My approach then would have been something like this, but I couldn't test it anymore

MATCH (t1:Token)-[r1:NEXT]->(t2:Token)-[r2:NEXT]->(t3:Token)-[r3:NEXT]->(t4:Token)
where r1.Briefnr = r2.Briefnr = r3.Briefnr 
with t1.value as v1, t2.value as v2, t3.value as v3, t4.value as v4, collect(r1.Briefnr) as Briefe
return v1, v2, v3, v4, Briefe
order by v1, v2, v3, v4

Best,
Reiner

A more graph-like approach would be something like this:

MATCH path = ((n:Token)-[:NEXT*4]-(m:Token)) 
  WHERE [x IN relationships(path) | x.Briefnr] = head(relationships(path)).Briefnr
WITH nodes(path) as n, head(relationships(path)).Briefnr as Brief
WITH n, COLLECT(Brief) as Briefe
WITH [x in n | x.value] as x, Briefe
RETURN *
``

andreas_kuczera
Graph Buddy

I transfered the db to another server:
http://sandstormserv.mni.thm.de:7474/browser/
login is the same. Maybe you can test it here ?
Best
Andreas

The DB now contains only the Text and the Token Nodes.

liebig
Node Clone

Hi Andreas,
I just quickly looked into your sample data and discovered, that you have multiple NEXT relationship which are redundant. I suggest to distill them down to just one.

Its not a bug, its a feature. Every edge has a property with the letternumber to get the textual contexts.

The following cypher working well for the first 7 letters (limit 7), but when the eights is included, this results in 7 million additional paths that will take too long to come a result.

Also that node with "" is still included.

match (:Token)-[r:NEXT]->(:Token)
with distinct r.Briefnr as Brief
limit 7
MATCH path = ((n:Token)-[:NEXT*3 {Briefnr:Brief}]-(m:Token)) 
with Brief, path
WITH distinct nodes(path) as n, Brief
WITH n, COLLECT(Brief) as Briefe
WHERE size(Briefe) > 1
WITH  [x in n | x.value] as x, Briefe
RETURN x, Briefe

I'm sorry not beeing able to invest more time here.

Best,
Reiner

The "" seems to be a linebreak in our data and we have to adress this. Thanks for this starting point !

Ok. But you don't need a relationship for every context. You could have one relationship with an array of contexts. The sheer number of relationships leads to a combinatorial explosion with the effect that Reiner's cypher query times out.
When aggregating the context in one relationship and with a BFS-style query you should be able to achieve what you want.

Ok. With this query i create the tokens:

MATCH (t:Text)
WITH t.Value as sentence, t.Name as b
WITH split(sentence," ") as words, b
FOREACH ( idx IN range(0,size(words)-2) |
MERGE (w1:Token {value:trim(words[idx])})
MERGE (w2:Token {value:trim(words[idx+1])})
CREATE (w1)-[r:NEXT {Briefnr:b}]->(w2));

I changed it this way:

// Property as Array with one Relation
MATCH (t:Text)
WITH t.Value as sentence, t.Name as b
WITH split(sentence," ") as words, b
FOREACH ( idx IN range(0,size(words)-2) |
MERGE (w1:Token {value:trim(words[idx])})
MERGE (w2:Token {value:trim(words[idx+1])})
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.list = b
ON MATCH SET r.list = r.list + ", " + b);

Now we have a list of letter numbers on the relationships. But how can i find common chains of Tokens of length 5 of two letters ?

My colleague @liebig figured out that with the above modell its not possible to recreate the letters texts from the graph as outgoing NEXT-edges could be ambigue. We will now try to create an explicit modell of the text with one node for every Token from a letter.