Propagating Sum() for a TFidf calculation Help Please

Hi,

Have a relatively simple graph


The relationship [:Is_in] contains a count variable that has how many times a specific word shows up in a patent.

I am first looking to include the TF part (Term frequency) into a property of the relationship.
And I am testing things out with a very simple query.

MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN a.term, r.count as Num,1.0*r.count/sum(r.count)

There are 14 relationships in this query and with a total of 26 total words (sum of r.count)

The issue is the SUM(r.count) is not going all the relationships and only seeing the single relationship (1 or 2).It looks like I am running into the issue of the return statement having both a grouping key and an aggravating function. So how do I get the aggravating function resolved before the grouping function? How do I get/pass a global sum (26) for the division?
Andy

Haven't tested this, but have you looked into using the collect function inbetween your match and return statements, e.g with..., collect(...)

Hi,

Not sure who collect which is an aggregating function similar to sum would be used.
If I do this:

MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN sum(r.count)

It returns a value of 26 which is correct.
If I do this:

MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN r.count,sum(r.count)

It groups by the possible values of r.count and gives this.

r.count sum(r.count)
2 24
1 2

If I try to include an aggregating step

MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
WITH r,sum(r.count) AS fred
RETURN r.count, fred

It does the group by

r.count fred
2 2
2 2
2 2
1 1
2 2
2 2

If I remove the r in the with clause I get an error.
Andy

Bumping this.

ANY HELP?
How do I get both the individual value and the sum of the values in the turn statement so I can calculate a simple ratio?

Andy

Hello @andy_hegedus :slight_smile:

Please give an example of calculation and result and I will try to make you the query.

Regards,
Cobra

Hi,

The objective here is to implement TFIDF (Term Frequency Inverse Document Frequency) where the document group will be be latter defined by a query. The first part is to calculate all the term frequencies (TF part) since they will not change as part of latter queries.
The data model contains two node types: Document, and Word, with one relationship [:Is_in] since not all words are in all documents and most words will be in multiple documents.
The relationship, [:Is_in] has a property, num, that defines how many times a word is in a given document. To calculate the the Term frequency I need two know two factors, the total number of words in a document and how many times that specific word is included. So for the example of a single document (I will need to expand it latter to run through all documents)

MATCH (:document{num:7863179})<-[r:Is_in]-(a:Word)
RETURN a.term,r.count, 1.0*r.count/sum(r.count)

Returns a value of 1.0 for the ratio which is not the intent. It appears the function sum(r.count) is being segmented by a and r and does not reflect the global sum count.

a.term r.count 1.0*r.count/sum(r.count)
"improved" 2 1.0
"produce" 2 1.0
"decrease" 2 1.0
"thickness" 1 1.0
"film" 2 1.0

If I try to calculate earlier in the query I cannot propagate the a and r variables

MATCH (:document{num:7863179})<-[r:Is_in]-(a:Word)
WITH sum(r.count) as total
RETURN a.term,r.count,1.0*r.count/total

Returns an error about a and r not being defined.

If I include a, and r in the With statement I get the same result as the first attempt.

a.term r.count 1.0*r.count/total
"improved" 2 1.0
"produce" 2 1.0
"decrease" 2 1.0
"thickness" 1 1.0
"film" 2 1.0

This query returns the correct number of total words

MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
WITH sum(r.count) as total
RETURN total

total
1 26

Andy

You have to collect() things if you want to propage them:

MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect(w) AS words, collect(r) AS relations
RETURN total, words, relations

Regards,
Cobra

Hi Cobra,

Sort of worked but not quite since I cannot access the individual count values in the return statement.
I did modify it a bit using your suggestion of collection, but then also adding an unwind.

MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total,  collect(r) AS texts
UNWIND texts as target
RETURN target.count, 1.0*target.count/total as TF

It does return values that make sense on the face of it. Though I don't know how to get the both the word property value word.term also.

target.count TF
2 0.07692307692307693
2 0.07692307692307693
2 0.07692307692307693
1 0.038461538461538464
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total,  collect({r:r, w:w}) AS texts
UNWIND texts AS target
RETURN target.w.term AS term, target.r.count AS count, 1.0*target.r.count/total AS TF

Hi Cobra,

Thank you. That collect notation is definitely new to me and not directly clear from the documentation. The object of the collect function is only listed as expression and when looking at the expression page in the documentation is not the most clarifying since it can be basically anything.

One slight tweak in the code provided:
1.0target.count/total AS TF
should be
1.0
target.r.count/total AS TF

Thank you again.
Andy

No problem, I corrected the query :slight_smile: you can collect and build a dict at the same time, it's very practical :slight_smile:

Hope this helped you solve your problem.

Regards,
Cobra