Newbie question: How to aggregate and count identical query results?


(Arunas Kraujutis) #1

I am trying to do some text analysis in Neo4j and I want to write a query where it sorts the number of results in a descending order. My data is structured: (Word)->[next]->(Word)->[Next] etc, I want to write a query which says which are the most popular 3 word combinations, 4 word combinations, etc. I tried this but it always gives a count of one for word combinations:

MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word) 
WITH [w1.name,w2.name,w3.name] AS word_pair 
RETURN COUNT(word_pair) as frequency, word_pair 
ORDER BY frequency DESC LIMIT 50

(Michael Hunger) #2

I think it's your data

This should actually work if you have the w1-w2-w3 pattern appearing in your graph more than once.

Can you share your data or an example?


(Arunas Kraujutis) #3

This is how I loaded the data, so there is only one node for a unique word, but the relationships repeat. I thought there was a way to take an output of a query and count how many times it repeats within the dataset.

WITH split(tolower("My cat eats fish on Saturday")," ") as text
Unwind range(0,size(text)-2) as i
MERGE (w1:Word {name: text[i]})
ON CREATE SET w1.count = 1 ON MATCH SET w1.count=w1.count+1
MERGE (w2:Word {name: text[i+1]})
ON CREATE SET w2.count = 1 ON MATCH SET w2.count=w2.count+1
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count=r.count+1
RETURN w1,r, w2

second statement

WITH split(tolower("My cat eats cat food on Mondays")," ") as text
Unwind range(0,size(text)-2) as i
MERGE (w1:Word {name: text[i]})
ON CREATE SET w1.count = 1 ON MATCH SET w1.count=w1.count+1
MERGE (w2:Word {name: text[i+1]})
ON CREATE SET w2.count = 1 ON MATCH SET w2.count=w2.count+1
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count=r.count+1

(Michael Hunger) #4

You're both right and wrong.

  1. if there were duplicate nodes/paths in the graph your query should work
  2. but your import statements make sure that there are no duplicate nodes and rels in the graph :slight_smile:

So what you need to do is to take either the word or relationship frequencies into account and sum them up.

e.g.

MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word)
RETURN [w1.name,w2.name,w3.name] AS word_pair,
sum(r.count + r2.count) as frequency
ORDER BY frequency DESC LIMIT 50

If you want to you can additionally sum up the counts of the words, or actually use a proper formula for computing a relevance score.


(Arunas Kraujutis) #5

Thanks, that's really helpful!