Fisher's Exact Test

Hi,

I've been working on a project with Neo4j involving genetics. My current data model is set up as follows:

(:Allele)<-[:HETEROZYGOUS|:HOMOZYGOUS]-(:Subject)-[:had_result]->(:Phenotype)

Where each :HOMOZYGOUS relationship type represents 2 relationships with the same :Allele, whereas a :HETEROZYGOUS is a single relationship to two different :Alleles.

I've created additional labels on :Allele, to identify those nodes which are shared across less than 1% of the population as :Rare. I've also created :High and :Low labels on :Phenotypes to identify where :Phenotype {value} are higher, or lower than two standard deviations of the mean for each different :Phenotype (each different :Phenotype also has its own specific label type, eg :BMI, :Insulin, etc).

I'm now trying to use Cypher to evaluate probabilities using Fisher's Exact Test. This is a standard statistical test which assesses the different possibilities. What I'm looking to do is identify the following:

Subject has High/Low Phenotype Subject has Normal Phenotype
Subject has the Allele: b a
Sibject doesn't have the Allele: d c

This is the Cypher that I have pulled together:

MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v:Rare)<--(a:Subject)-->(p:Phenotype)
WHERE a<>b AND n:High OR n:Low
MATCH (n)<--(d:Subject)-[r]->(u:Rare)
MATCH (u:Rare)<--(c:Subject)-->(p:Phenotype)
WHERE v<>u AND c<>d AND n:High OR n:Low
RETURN labels(v), labels(u), v.pos, v.bp, v.SNPid. u.pos, u.bp, u.SNPid,
CASE WHEN r.type='HOMOZYGOUS' THEN (2*count(r))
WHEN r.type='HETEROZYGOUS' THEN count(r) END AS alleleFreq,
a, b, c, d, labels(p)
ORDER BY alleleFreq LIMIT 1000

There are somewhere around 1.5m :Allele nodes, ~1800 :Subject nodes, with ~800m relationships connecting them all together. This is an intensive query, which assesses each node pairing individually, so appreciate there's a lot to work through.

I'm only working on a desktop with 8GB RAM, so have had to use considerable swap space and so far this has been running for about a week!... I was hoping to get some advice on the following:

  1. Does this query create the result I'm looking for?
  2. Is there a way to produce either the results of v or u instead of having to request both?
  3. Is there a way to optimise this, possibly using APOC procedures, with the intent to either speed up the operation or reduce the amount of memory required?
  4. Is this a very ineffective way to do this - I appreciate most people would use the query results and manipulate them outside of Neo4j, but I'm interested in whether Neo4j can be used effectively for statistical analysis like this.

Any help would be much appreciated!

Dave

Please post your question(s). Thanks.

1 Like

Sorry @ameyasoft, I'm too used to pressing Shift + Enter when typing these days! The question is ready now!

Thanks for your reply. Few questions:

Rare, High, and Low are these node labels or node properties?. Also, you mentioned High and Low as labels for Phenotype. Please help me understand your data model.
Thanks

1 Like

Sure, I have a lot of multi-labelled entities. Rare is an additional label for Allele nodes, and High and Low are additional labels for Phenotype nodes. I made this decision because I thought it would be easier to manage with such a large dataset, but perhaps this isn't an ideal way of managing the data and it would have been better represented with relationship taxonomies.

In summary, Alleles have the following properties:

pos - represents the position of the allele on each chromosome
bp - represents the base pair of that specific allele/SNP (Single Nucleotide Polymorphism)
SNPid - if this is a know SNP, it gives the reference for it

Each Allele node can have the following additional labels:

chromosome_n - where n represents the number of the chromosome the allele is a part of. Ideally, I would have created a taxonomy through relationships, but for this, I thought it would be easier to use labels.

Reference or Variant - this represents whether the allele is part of the genome reference, or is a variant from this (perhaps it is a less common allele than most of the population). There is also a relationship of alternative_of between each Reference and Variant for the same.

Rare - if the allele has a relationship with fewer than 1% of the population (of Subjects) then it will have this label.

Each subject has two alleles at each position. Therefore Allele nodes are connected to Subject nodes by two different types of relationship:

HETEROZYGOUS - represents a single connection, when the two alleles are different at each position on a chromosome.

HOMOZYGOUS is a single relationship between a Subject and Allele and a single position on a chromosome but represents two of the same relationship.

Subject nodes have one property:

subject_id - represents the anonymised unique identifier for each sample subject.

Subject nodes also have additional labels:

cohort - To enable stratification of different cohorts, each cohort will have its own label.
sex - signifies the sex of the subject

Subject nodes are connected to Phenotype nodes by only one relationship type;

had_result - represents the results of phenotype tests for each Subject (where available)

This had_result relationship also has a property of at_age, which shows the age that the measurement/value was taken where available.

Phenotype nodes have a single property:

value - this is a single value figure, representing whichever measure is taken for that phenotype.

Phenotypes have a large variety of additional labels which specify the phenotype taxonomy:

Phenotype Group - This might be Anthropometric Measurement or Haemotology and represents the grouping of the phenotype.

Phenotype Type - This might be Total Lean Mass or Haemoglobin and is the specific phenotype that the value property represents. There are approximately 50 different phenotype types in this dataset

In addition, there is also the High or Low label on Phenotypes with extreme values outside of 2sd of the mean for each Phenotype Type and across 4 different age ranges. Those that are higher than 2sds are given the label High, and those lower than 2sds are given the label Low.

Hopefully, that gives you a good idea of the data model I'm working with. Please feel free to ask any questions.

Thanks for the detailed explanation. This is very helpful.

1 Like

Please let me know which version of Neo4j you are using.

Try this:

OPTIONAL MATCH (n1:High)
WITH collect(distinct n1) as h1

OPTIONAL MATCH (n2:Low) 
WITH collect(distinct n2) + h1 as h2

UNWIND h2 as n

MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v)<--(a:Subject)-->(p:Phenotype)
WHERE id(b) <> id(a)

WITH p, n, a, b, v, r

MATCH (n)<--(d:Subject)-[r]->(u:Rare)
MATCH (u)<--(c:Subject)-->(p)
WHERE id(d) <> id(c)

WITH p, n, a, b, v, u, d, c, r

RETURN labels(v), labels(u), v.pos, v.bp, v.SNPid. u.pos, u.bp, u.SNPid,
CASE WHEN r.type='HOMOZYGOUS' THEN (2*count(r))
WHEN r.type='HETEROZYGOUS' THEN count(r) END AS alleleFreq, labels(p)
ORDER BY alleleFreq LIMIT 100
 
a, b, c, d represent nodes and your return is a table with node properties. RETURN a, b, c, d will result in a graph with those nodes.

Also try each portion of the code to see if you are getting the expected results.
Like after the OPTIONAL MATCH/ UNWIND h2 as n RETURN(n). Check the output.

Thanks @ameyasoft. I’m using Neo4j 3.5.14.

Will this be beneficial on memory usage or performance, or both? I only ask so I know whether to reduce the heap size to something nearer my actual memory size, or if it will still require a similar amount of memory?

Memory and performance both. AND n:High OR n:Low I think this does all nodes scan and that's why I used two optional match each one with different label and collected the nodes. In my opinion this should increase the performance

1 Like

Great, I’ll stop the previous query (it’s been going for about about 10 days now), reset the heap to something smaller and let you know how I get on!

Hi @ameyasoft, this is still taking quite a bit of time. Is there a feasible way to use APOC with this to make it a lot quicker, or doesn't this query lend itself to that?

OPTIONAL MATCH (n1:High)
WITH collect(distinct n1) as h1

OPTIONAL MATCH (n2:Low) 
WITH collect(distinct n2) + h1 as h2

UNWIND h2 as n

//RETURN COUNT(n)
//time to get the count..


MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v)<--(a:Subject)-->(p:Phenotype)
WHERE id(b) <> id(a)

WITH p, n, a, b, v, r

//RETURN p, n, a, b, v, r LIMIT 40

Run these checks and let me know. Thanks.

Hi @ameyasoft,
The first count came back as 2,093 and returned within 108ms.
The second query returned 40 records within 1ms and completed after 1813ms.

I guess it's the sheer volume of records being queried that is making the query hang (that and the restricted amount of memory my computer has). I guess the issue is that there are around 200k :Rare nodes, so it's doing ~400,000,000 possible queries?

I guess what I need is counts for a, b, c and d for each :Subject, but not sure how to get that instead of just showing the specific nodes, which is probably a much bigger query!

Change this return: //RETURN p, n, a, b, v, r LIMIT 40

to RETURN count(distinct a) as Cnta, count(distinct b) as Cntb.

Check if this helps.
1 Like

This is still running 4 days on @ameyasoft. Perhaps there’s not a good way to do this in cypher and it just needs to be done more statistically outside of Neo4j?

Sorry to hear that. Is it possible to send me some sample data so that I can run on it my machine. You can send the file to ameyasoft@gmail.com.

Interesting read, thank you for posting the question, and information on the dataset. I agree with ameyasoft, the query challenge is complicated enough that having access to the dataset (or subset) maybe required to give meaningful help.

Regarding query construction: This may be apparent now, but I didn't see it actually stated.

I recommend always (carefully) using LIMIT statements to return just a few and verify the results first before running queries wide open on all data.

I feel this is good advice for new neo4j users, but also for everyone with non-trivial data sets.

Also, if a query takes more than a few minutes to run, I'd likely kill the query and look at the query PROFILE, and redesign the query. Once sure the query does what you want, you could estimate the time to complete by extrapolation from timing smaller result returns (e.g. LIMIT 100 and LIMIT 1000) Note: this may be an exponential curve

Thanks @Joel , I haven't had time to review the dataset, but I'll have to review as it is genomic and therefore sensitive data. It's still running 6 days in, so I guess I should kill it. I'll run an explain on it afterwards.

It is likely to be exponential, as it was ok with a limit of ~40. I'll try and run it again with a limit of 40 subjects and profile it, to see how that fairs.

The plan is below; I imagine the 50 trillion rows is the issue!...