Count Function in Return versus Set Statement

andy_hegedus · October 7, 2020, 6:24pm

Hi,

Working on a TF-IDF graph with a bit of dynamic scoring. Testing out some functionality with this code:

Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
WITH count(DISTINCT a) as Total_Doc, Collect({r1:r1, r2:r2}) as Texts
Unwind Texts as test
RETURN test.r1,test.r1.TF,count(test.r2), Total_Doc, log(1.0*Total_Doc/count(test.r2))  as idf, log(1.0*Total_Doc/count(test.r2))*test.r1.TF as TFidf

So I can look at the data and hand calculate to verify that code is correct It looks like the final TFidf is correct and now I want to set it as property key into the first relationship.

Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
WITH count(DISTINCT a) as Total_Doc, Collect({r1:r1, r2:r2}) as Texts
Unwind Texts as test
SET test.r1.TFidf= log(1.0*Total_Doc/count(test.r2))*test.r1.TF

And it is returning an error complaining about the count function. How do I work around this?
Andy

andrew_bowman · October 8, 2020, 12:47am

Use a WITH clause to project the result of the count() to a variable, then use it in a SET afterwards.

Typically you'll only want to use aggregation functions within WITH or RETURN clauses.

andy_hegedus · October 8, 2020, 5:54pm

Hi,

I have scoured the documentation many a time and it is really deficient in describing how the aggregation process with return and with work. Is there another document you could point to?

Basically I need three levels of aggregation:

The total number of documents.
Within each document there are n number of words associated with it.
For each word in a document there x number of other documents associated.

And to calculate for each word in a document a metric that relates it to the number of other documents and the total number.

In my query with return that did seem to give the correct result uses a collect statement to give the correct group. I cannot include a count(r2) in the collect statement because an aggregation function cannot be embedded in aggregation function.

Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
WITH count(DISTINCT a) as Total_Doc, Collect({r1:r1, r2:r2}) as Texts
Unwind Texts as test
RETURN test.r1,test.r1.TF,count(test.r2), Total_Doc, log(1.0*Total_Doc/count(test.r2))  as idf, log(1.0*Total_Doc/count(test.r2))*test.r1.TF as TFidf

I have tried this and it does NOT give the results needed. Total_Docs and Other _Docs are just the total number of relationships.

Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
With count(a) as Total_Docs, Collect(r1) as Texts, count(r2) as Other_Docs
Unwind Texts as test
RETURN test.r1, Total_Docs, Other_Docs

Andy

andrew_bowman · October 8, 2020, 6:33pm

Here's the documentation on aggregations:

The most important thing to be aware of is that when you aggregate, the non-aggregation variables present become the grouping key, which is the context for the aggregation. If you don't have your grouping key right, then you will be counting the wrong things.

Thanks for the detailed description of what you need. Let's see how to get all 3.

First, the total number of documents. We should get that separately from the rest, and if we keep it simple it will leverage the counts store and get a quick count for us:

MATCH (a:patent)
WITH count(a) as Total_Doc
...

There. That will be faster than your current approach.

Now to get the number of words per document. If we're just counting the :Is_in relationships per document, we can use the size(<pattern>) which will get the degree of that relationship from the node, which is pretty quick (note that size() isn't an aggregation function):

MATCH (a:patent)
WITH count(a) as Total_Doc
MATCH (a:patent)
WITH Total_Doc, a, size((a)-[:Is_in]-()) as Words_per_Doc
...

If you need the count of distinct words in the doc then that's different, we can't use the degree approach. We would have to MATCH out and count () the distinct words per document:

...
MATCH (a:patent)-[:Is_in]-(b:Word)
WITH Total_Doc, a, count(DISTINCT b) as Distinct_Words_per_Doc
...

Note that a has to be part of the grouping key so the words counted are per document, and not just the total words across all documents.

That said, I'm guessing you already handled this so that :Is_in only occurs once between a document and a distinct word. That may make the last part easier. We can use the degree of the :Word node -1 to get the number of other documents the word occurs in (we're not counting the current document).

MATCH (a:patent)
WITH count(a) as Total_Doc
MATCH (a:patent)
WITH Total_Doc, a, size((a)-[:Is_in]-()) as Words_per_Doc
MATCH (a)-[r1:Is_in]-(b:Word)
WITH Total_Doc, a, Words_per_Doc, r1, size((b)-[:Is_in]-()) - 1 as Other_Docs
WITH Total_Doc, r1, r1.TF as TF, Other_Docs, log(1.0*Total_Doc/Other_Docs) as idf
RETURN Total_Doc, r1, TF, idf, idf * TF as TFidf

andy_hegedus · October 8, 2020, 7:56pm

Hi Andrew,

Thank you very much for your very clear assistance. I have gotten it to work and able to set the values in the relationship.

Thank you again

Andy

Topic		Replies	Views
Propagating Sum() for a TFidf calculation Help Please Cypher	10	429	October 3, 2020
SET Property Cypher	2	302	April 8, 2022
Using WITH clause along with an aggregating function like count() Cypher	3	1803	May 28, 2021
How to add a degree property to a node Cypher	3	439	November 22, 2021
Aggregation, Returns, & Functions Neo4j Website	0	1633	August 5, 2020

July Summer Fun!

Count Function in Return versus Set Statement

Related topics