I am trying to do some text analysis in Neo4j and I want to write a query where it sorts the number of results in a descending order. My data is structured: (Word)->[next]->(Word)->[Next] etc, I want to write a query which says which are the most popular 3 word combinations, 4 word combinations, etc. I tried this but it always gives a count of one for word combinations:
MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word)
WITH [w1.name,w2.name,w3.name] AS word_pair
RETURN COUNT(word_pair) as frequency, word_pair
ORDER BY frequency DESC LIMIT 50
I think it's your data
This should actually work if you have the w1-w2-w3 pattern appearing in your graph more than once.
Can you share your data or an example?
This is how I loaded the data, so there is only one node for a unique word, but the relationships repeat. I thought there was a way to take an output of a query and count how many times it repeats within the dataset.
WITH split(tolower("My cat eats fish on Saturday")," ") as text
Unwind range(0,size(text)-2) as i
MERGE (w1:Word {name: text[i]})
ON CREATE SET w1.count = 1 ON MATCH SET w1.count=w1.count+1
MERGE (w2:Word {name: text[i+1]})
ON CREATE SET w2.count = 1 ON MATCH SET w2.count=w2.count+1
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count=r.count+1
RETURN w1,r, w2
second statement
WITH split(tolower("My cat eats cat food on Mondays")," ") as text
Unwind range(0,size(text)-2) as i
MERGE (w1:Word {name: text[i]})
ON CREATE SET w1.count = 1 ON MATCH SET w1.count=w1.count+1
MERGE (w2:Word {name: text[i+1]})
ON CREATE SET w2.count = 1 ON MATCH SET w2.count=w2.count+1
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count=r.count+1
You're both right and wrong.
- if there were duplicate nodes/paths in the graph your query should work
- but your import statements make sure that there are no duplicate nodes and rels in the graph
So what you need to do is to take either the word or relationship frequencies into account and sum them up.
e.g.
MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word)
RETURN [w1.name,w2.name,w3.name] AS word_pair,
sum(r.count + r2.count) as frequency
ORDER BY frequency DESC LIMIT 50
If you want to you can additionally sum up the counts of the words, or actually use a proper formula for computing a relevance score.
Thanks, that's really helpful!