Getting a total, maximum and ratio in query

Hi,

I’m trying to get a ratio of relationship count of each node, to the highest number of relationships for any more. I’ve tried a few things, but the closest I’ve got is this:

MATCH (n:Allele)
WITH n, size((n)<-[:HETEROZYGOUS]-()) AS hetCount, 2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
(size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-()))) as totCount
where totCount > 0
RETURN n, hetCount, homCount, totCount, max(totCount) as outOf, totCount / max(totCount) as ratio ORDER BY totCount ASC LIMIT 100

This is giving me everything I want, except the maximum, and therefore the correct ratio (I imagine it could be optimised as well!). My plan is to use this to set the ratio as a property on :Allele which will then show the rarity of the allele, by number of relationships and can be rerun and indexed everytime new Allele nodes are added.

I have ~1.5m nodes and ~800m relationships that need to be counted so this is quite a large query, so am not sure if a gds algorithm would be a suitable alternative. Any help would be much appreciated.

Thanks

One of these will probably do it:

...but first, I need to reformat your Cypher so I can make better sense of it.

MATCH (n:Allele)
WITH
    n,
    size((n)<-[:HETEROZYGOUS]-()) AS hetCount,
    2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
    (size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-()))) AS totCount
WHERE
    totCount > 0
RETURN
    n, hetCount, homCount, totCount,
    max(totCount) AS outOf,
    totCount / max(totCount) AS ratio
ORDER BY totCount ASC
LIMIT 100

Your root issue here is the max in the return clause has an implicit GROUP BY each row. To fix that, I'm sad to say, you're gonna have to run the query twice. Once to get the real totals, and a second time to get the values you're looking for, that needs the total and max to do the math.

MATCH (n:Allele)
WITH max((size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-())))) AS totCountMax
MATCH (n:Allele)
WITH
    n, totCountMax,
    size((n)<-[:HETEROZYGOUS]-()) AS hetCount,
    2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
    hetCount + homCount AS totCount
WHERE
    totCount > 0
RETURN
    n, hetCount, homCount, totCount,
    totCountMax AS outOf,
    totCount / totCountMax AS ratio
ORDER BY totCount ASC
LIMIT 100

A better way.

Rather than doing all the path matching like this, it might be worth mutating the data a little, to make it all run more efficiently.

MATCH (n:Allele)
SET n.hetCount = size((n)<-[:HETEROZYGOUS]-())
SET n.homCount= 2*size((n)<-[:HOMOZYGOUS]-())
;

# Then, this should run better
MATCH (n:Allele)
WITH n, max(n.hetCount + n.homCount) as totCountMax
. . .

Lastly, a few references that might help with similar problems:

2 Likes

Thanks, @tony.chiboucas. I'll try that later once my current query is done - using a desktop so having to use swap for my queries on such a large dataset is seriously impacting performance. Might need to take some lessons on APOC soon as well!

Hi @tony.chiboucas. Just to clarify, the data transformation would set properties on the nodes, right? This is fine, but I was initially worried about the increased storage required across the large number of nodes (although I realise there aren't actually that many relative to relationships, which I had issues with properties on previously). I assume if this is the case, I could set a clause in another query, just to specifically remove these if I want to just set the ratio and not the counts?

Yes, that would add properties to :Allele nodes. Total increase in size-on-disk will be less than 5MB.

Absolutely. I would caution that it might be worth keeping the data, just to make recalculation much faster and easier.

One additional note: pulling all the relationships and counting them, will take a lot of time with such a large number of rels. A little apoc.periodic.iterate could go a long way here. Once you've counted the relationships, subsequent queries should be much faster.

Hi @tony.chiboucas, just about to try this now. Is it worth indexing the hetCount and homCount properties for this?

Hi @tony.chiboucas. I've set the hetCount and homCount now, and thanks, even without indexing this made the second query much quicker. However, I've used the following query and the outOf column is just showing the max for n, not the max for all of n and making ratio always 1:

MATCH (n:Allele)
WITH n, max(n.hetCount + n.homCount as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC

What am I doing wrong?

MATCH (n:Allele)

# HERE is your trouble. Let me explain...
# WITH n, max(n.hetCount + n.homCount as totCountMax
WITH max(n.hetCount + n.homCount as totCountMax

MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC

If you think about Cypher MATCH results as a spreadsheet (you should check the sheet results in these cases), you'll get one row for every MATCH. Aggregate functions will have an inherent GROUP BY on any variables you're including.

I've built a little NeoConsole to play with: http://console.neo4j.org/r/3bsdhd
image

x.val = 1
y.val = 2
z.val = 4

Now, what happens below, is that Cypher is trying to keep the b contents in the result, so the SUM aggregate function won't collapse the values of l.val beyond that. Essentially turning whatever you KEEP IN CONTEXT (b here) as an implicit GROUP BY.

MATCH (b:Box)-[]-(l:Label)
RETURN b, SUM(l.val) as sum;

image

However, if you omit that variable to get an aggregate of all the things...

MATCH (b:Box)-[]-(l:Label)
RETURN SUM(l.val) as sum;

image

You get the aggregate you need, the trick is then in using it across all the b nodes here...

MATCH (b:Box)-[]-(l:Label)
WITH SUM(l.val) as sum
MATCH (b:Box)-[]-(l:Label)
# now you have SUM and b, and l in context...
RETURN b, l, sum

image

1 Like

Thanks. I went for this in the end, although don't think I necessarily needed the return:

MATCH (n:Allele)
WITH MAX(n.hetCount + n.homCount) as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
SET n.frequency = ((totCount * 100.0000) / totCountMax)
RETURN n, n.hetCount, n.homCount, totCount, totCountMax n.frequency AS frequency
ORDER BY totCount ASC

1 Like