The relationship [:Is_in] contains a count variable that has how many times a specific word shows up in a patent.
I am first looking to include the TF part (Term frequency) into a property of the relationship.
And I am testing things out with a very simple query.
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN a.term, r.count as Num,1.0*r.count/sum(r.count)
There are 14 relationships in this query and with a total of 26 total words (sum of r.count)
The issue is the SUM(r.count) is not going all the relationships and only seeing the single relationship (1 or 2).It looks like I am running into the issue of the return statement having both a grouping key and an aggravating function. So how do I get the aggravating function resolved before the grouping function? How do I get/pass a global sum (26) for the division?
Andy
The objective here is to implement TFIDF (Term Frequency Inverse Document Frequency) where the document group will be be latter defined by a query. The first part is to calculate all the term frequencies (TF part) since they will not change as part of latter queries.
The data model contains two node types: Document, and Word, with one relationship [:Is_in] since not all words are in all documents and most words will be in multiple documents.
The relationship, [:Is_in] has a property, num, that defines how many times a word is in a given document. To calculate the the Term frequency I need two know two factors, the total number of words in a document and how many times that specific word is included. So for the example of a single document (I will need to expand it latter to run through all documents)
MATCH (:document{num:7863179})<-[r:Is_in]-(a:Word)
RETURN a.term,r.count, 1.0*r.count/sum(r.count)
Returns a value of 1.0 for the ratio which is not the intent. It appears the function sum(r.count) is being segmented by a and r and does not reflect the global sum count.
a.term
r.count
1.0*r.count/sum(r.count)
"improved"
2
1.0
"produce"
2
1.0
"decrease"
2
1.0
"thickness"
1
1.0
"film"
2
1.0
If I try to calculate earlier in the query I cannot propagate the a and r variables
MATCH (:document{num:7863179})<-[r:Is_in]-(a:Word)
WITH sum(r.count) as total
RETURN a.term,r.count,1.0*r.count/total
Returns an error about a and r not being defined.
If I include a, and r in the With statement I get the same result as the first attempt.
a.term
r.count
1.0*r.count/total
"improved"
2
1.0
"produce"
2
1.0
"decrease"
2
1.0
"thickness"
1
1.0
"film"
2
1.0
This query returns the correct number of total words
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
WITH sum(r.count) as total
RETURN total
You have to collect() things if you want to propage them:
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect(w) AS words, collect(r) AS relations
RETURN total, words, relations
Sort of worked but not quite since I cannot access the individual count values in the return statement.
I did modify it a bit using your suggestion of collection, but then also adding an unwind.
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect(r) AS texts
UNWIND texts as target
RETURN target.count, 1.0*target.count/total as TF
It does return values that make sense on the face of it. Though I don't know how to get the both the word property value word.term also.
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect({r:r, w:w}) AS texts
UNWIND texts AS target
RETURN target.w.term AS term, target.r.count AS count, 1.0*target.r.count/total AS TF
Thank you. That collect notation is definitely new to me and not directly clear from the documentation. The object of the collect function is only listed as expression and when looking at the expression page in the documentation is not the most clarifying since it can be basically anything.
One slight tweak in the code provided:
1.0target.count/total AS TF
should be
1.0target.r.count/total AS TF