Getting a total, maximum and ratio in query

david.carter · August 8, 2020, 8:25am

Hi,

I’m trying to get a ratio of relationship count of each node, to the highest number of relationships for any more. I’ve tried a few things, but the closest I’ve got is this:

MATCH (n:Allele)
WITH n, size((n)<-[:HETEROZYGOUS]-()) AS hetCount, 2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
(size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-()))) as totCount
where totCount > 0
RETURN n, hetCount, homCount, totCount, max(totCount) as outOf, totCount / max(totCount) as ratio ORDER BY totCount ASC LIMIT 100

This is giving me everything I want, except the maximum, and therefore the correct ratio (I imagine it could be optimised as well!). My plan is to use this to set the ratio as a property on :Allele which will then show the rarity of the allele, by number of relationships and can be rerun and indexed everytime new Allele nodes are added.

I have ~1.5m nodes and ~800m relationships that need to be counted so this is quite a large query, so am not sure if a gds algorithm would be a suitable alternative. Any help would be much appreciated.

Thanks

tony.chiboucas · August 13, 2020, 10:37pm

One of these will probably do it:

Cypher max()
Combine Cypher collect() with apoc.coll.max

...but first, I need to reformat your Cypher so I can make better sense of it.

MATCH (n:Allele)
WITH
    n,
    size((n)<-[:HETEROZYGOUS]-()) AS hetCount,
    2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
    (size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-()))) AS totCount
WHERE
    totCount > 0
RETURN
    n, hetCount, homCount, totCount,
    max(totCount) AS outOf,
    totCount / max(totCount) AS ratio
ORDER BY totCount ASC
LIMIT 100

Your root issue here is the max in the return clause has an implicit GROUP BY each row. To fix that, I'm sad to say, you're gonna have to run the query twice. Once to get the real totals, and a second time to get the values you're looking for, that needs the total and max to do the math.

MATCH (n:Allele)
WITH max((size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-())))) AS totCountMax
MATCH (n:Allele)
WITH
    n, totCountMax,
    size((n)<-[:HETEROZYGOUS]-()) AS hetCount,
    2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
    hetCount + homCount AS totCount
WHERE
    totCount > 0
RETURN
    n, hetCount, homCount, totCount,
    totCountMax AS outOf,
    totCount / totCountMax AS ratio
ORDER BY totCount ASC
LIMIT 100

A better way.

Rather than doing all the path matching like this, it might be worth mutating the data a little, to make it all run more efficiently.

MATCH (n:Allele)
SET n.hetCount = size((n)<-[:HETEROZYGOUS]-())
SET n.homCount= 2*size((n)<-[:HOMOZYGOUS]-())
;

# Then, this should run better
MATCH (n:Allele)
WITH n, max(n.hetCount + n.homCount) as totCountMax
. . .

Lastly, a few references that might help with similar problems:

https://neo4j.com/developer/cypher/aggregation-returns-functions/

david.carter · August 14, 2020, 9:09am

Thanks, @tony.chiboucas. I'll try that later once my current query is done - using a desktop so having to use swap for my queries on such a large dataset is seriously impacting performance. Might need to take some lessons on APOC soon as well!

david.carter · August 14, 2020, 12:49pm

Hi @tony.chiboucas. Just to clarify, the data transformation would set properties on the nodes, right? This is fine, but I was initially worried about the increased storage required across the large number of nodes (although I realise there aren't actually that many relative to relationships, which I had issues with properties on previously). I assume if this is the case, I could set a clause in another query, just to specifically remove these if I want to just set the ratio and not the counts?

tony.chiboucas · August 14, 2020, 5:12pm

Yes, that would add properties to :Allele nodes. Total increase in size-on-disk will be less than 5MB.

Absolutely. I would caution that it might be worth keeping the data, just to make recalculation much faster and easier.

One additional note: pulling all the relationships and counting them, will take a lot of time with such a large number of rels. A little apoc.periodic.iterate could go a long way here. Once you've counted the relationships, subsequent queries should be much faster.

david.carter · August 19, 2020, 9:02am

Hi @tony.chiboucas, just about to try this now. Is it worth indexing the hetCount and homCount properties for this?

david.carter · August 19, 2020, 12:19pm

Hi @tony.chiboucas. I've set the hetCount and homCount now, and thanks, even without indexing this made the second query much quicker. However, I've used the following query and the outOf column is just showing the max for n, not the max for all of n and making ratio always 1:

MATCH (n:Allele)
WITH n, max(n.hetCount + n.homCount as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC

What am I doing wrong?

tony.chiboucas · August 19, 2020, 4:57pm

MATCH (n:Allele)

# HERE is your trouble. Let me explain...
# WITH n, max(n.hetCount + n.homCount as totCountMax
WITH max(n.hetCount + n.homCount as totCountMax

MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC

If you think about Cypher MATCH results as a spreadsheet (you should check the sheet results in these cases), you'll get one row for every MATCH. Aggregate functions will have an inherent GROUP BY on any variables you're including.

I've built a little NeoConsole to play with: Neo4j Console

x.val = 1
y.val = 2
z.val = 4

Now, what happens below, is that Cypher is trying to keep the b contents in the result, so the SUM aggregate function won't collapse the values of l.val beyond that. Essentially turning whatever you KEEP IN CONTEXT (b here) as an implicit GROUP BY.

MATCH (b:Box)-[]-(l:Label)
RETURN b, SUM(l.val) as sum;

However, if you omit that variable to get an aggregate of all the things...

MATCH (b:Box)-[]-(l:Label)
RETURN SUM(l.val) as sum;

You get the aggregate you need, the trick is then in using it across all the b nodes here...

MATCH (b:Box)-[]-(l:Label)
WITH SUM(l.val) as sum
MATCH (b:Box)-[]-(l:Label)
# now you have SUM and b, and l in context...
RETURN b, l, sum

david.carter · August 23, 2020, 2:29pm

Thanks. I went for this in the end, although don't think I necessarily needed the return:

MATCH (n:Allele)
WITH MAX(n.hetCount + n.homCount) as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
SET n.frequency = ((totCount * 100.0000) / totCountMax)
RETURN n, n.hetCount, n.homCount, totCount, totCountMax n.frequency AS frequency
ORDER BY totCount ASC

Topic		Replies	Views
Return how many specific relationship there are for a specific node Cypher	4	475	April 29, 2020
Limiting the amount of edges specifically Cypher cypher	0	874	March 29, 2019
Counting Number of Relationships Per Grouped Nodes Cypher apoc , performance , cypher , operations , relationship , index	2	1185	November 21, 2020
Cypher Query - Multiple Counts based on relationship properties to determine percentages Cypher cypher	0	1348	June 27, 2019
Cypher Query - Count function Cypher cypher , neo4j , neo4j-desktop	4	273	November 22, 2021

Getting a total, maximum and ratio in query

A better way.

Lastly, a few references that might help with similar problems:

Related topics