Hi,
I've been working on a project with Neo4j involving genetics. My current data model is set up as follows:
(:Allele)<-[:HETEROZYGOUS|:HOMOZYGOUS]-(:Subject)-[:had_result]->(:Phenotype)
Where each :HOMOZYGOUS
relationship type represents 2 relationships with the same :Allele
, whereas a :HETEROZYGOUS
is a single relationship to two different :Allele
s.
I've created additional labels on :Allele
, to identify those nodes which are shared across less than 1% of the population as :Rare
. I've also created :High
and :Low
labels on :Phenotype
s to identify where :Phenotype {value}
are higher, or lower than two standard deviations of the mean for each different :Phenotype
(each different :Phenotype
also has its own specific label type, eg :BMI
, :Insulin,
etc).
I'm now trying to use Cypher to evaluate probabilities using Fisher's Exact Test. This is a standard statistical test which assesses the different possibilities. What I'm looking to do is identify the following:
Subject has High/Low Phenotype | Subject has Normal Phenotype | |
---|---|---|
Subject has the Allele: | b | a |
Sibject doesn't have the Allele: | d | c |
This is the Cypher that I have pulled together:
MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v:Rare)<--(a:Subject)-->(p:Phenotype)
WHERE a<>b AND n:High OR n:Low
MATCH (n)<--(d:Subject)-[r]->(u:Rare)
MATCH (u:Rare)<--(c:Subject)-->(p:Phenotype)
WHERE v<>u AND c<>d AND n:High OR n:Low
RETURN labels(v), labels(u), v.pos, v.bp, v.SNPid. u.pos, u.bp, u.SNPid,
CASE WHEN r.type='HOMOZYGOUS' THEN (2*count(r))
WHEN r.type='HETEROZYGOUS' THEN count(r) END AS alleleFreq,
a, b, c, d, labels(p)
ORDER BY alleleFreq LIMIT 1000
There are somewhere around 1.5m :Allele nodes, ~1800 :Subject nodes, with ~800m relationships connecting them all together. This is an intensive query, which assesses each node pairing individually, so appreciate there's a lot to work through.
I'm only working on a desktop with 8GB RAM, so have had to use considerable swap space and so far this has been running for about a week!... I was hoping to get some advice on the following:
- Does this query create the result I'm looking for?
- Is there a way to produce either the results of v or u instead of having to request both?
- Is there a way to optimise this, possibly using APOC procedures, with the intent to either speed up the operation or reduce the amount of memory required?
- Is this a very ineffective way to do this - I appreciate most people would use the query results and manipulate them outside of Neo4j, but I'm interested in whether Neo4j can be used effectively for statistical analysis like this.
Any help would be much appreciated!
Dave