Here's the documentation on aggregations:
The most important thing to be aware of is that when you aggregate, the non-aggregation variables present become the grouping key, which is the context for the aggregation. If you don't have your grouping key right, then you will be counting the wrong things.
Thanks for the detailed description of what you need. Let's see how to get all 3.
First, the total number of documents. We should get that separately from the rest, and if we keep it simple it will leverage the counts store and get a quick count for us:
MATCH (a:patent)
WITH count(a) as Total_Doc
...
There. That will be faster than your current approach.
Now to get the number of words per document. If we're just counting the :Is_in relationships per document, we can use the size(<pattern>)
which will get the degree of that relationship from the node, which is pretty quick (note that size() isn't an aggregation function):
MATCH (a:patent)
WITH count(a) as Total_Doc
MATCH (a:patent)
WITH Total_Doc, a, size((a)-[:Is_in]-()) as Words_per_Doc
...
If you need the count of distinct words in the doc then that's different, we can't use the degree approach. We would have to MATCH out and count () the distinct words per document:
...
MATCH (a:patent)-[:Is_in]-(b:Word)
WITH Total_Doc, a, count(DISTINCT b) as Distinct_Words_per_Doc
...
Note that a
has to be part of the grouping key so the words counted are per document, and not just the total words across all documents.
That said, I'm guessing you already handled this so that :Is_in only occurs once between a document and a distinct word. That may make the last part easier. We can use the degree of the :Word node -1 to get the number of other documents the word occurs in (we're not counting the current document).
MATCH (a:patent)
WITH count(a) as Total_Doc
MATCH (a:patent)
WITH Total_Doc, a, size((a)-[:Is_in]-()) as Words_per_Doc
MATCH (a)-[r1:Is_in]-(b:Word)
WITH Total_Doc, a, Words_per_Doc, r1, size((b)-[:Is_in]-()) - 1 as Other_Docs
WITH Total_Doc, r1, r1.TF as TF, Other_Docs, log(1.0*Total_Doc/Other_Docs) as idf
RETURN Total_Doc, r1, TF, idf, idf * TF as TFidf