Cypher query validation for genome analysis

davemate23 · January 28, 2019, 8:02pm

Hi,

I'm in the last couple of days before my dissertation is due and after a long few weeks of manipulating a large amount of genomic and phenotype data, it should be ready to import tomorrow... I know I'm cutting it very fine - so apologies for this, but any help in the next day or so would be hugely appreciated!

I'm after validation of my queries prior to my database being ready as I won't have long to test it and it's a big database on a not-so-big powered computer. I'm using Neo4j 3.5.1 on the Desktop version in Linux Mint. My data has spent a few days on my (relatively slow) being curated and I hope to load it via Neo4j-import tomorrow, so this is pre-emptive.

I have the following labels:

(:Chromosome) <-[:BELONGS_TO]-(:Allele)<-[:CARRIES]-(:Sample)-[r]->(:Phenotype)

Where [r] has a variety of types between samples and phenotypes.

There is also a relationship on certain alleles in the same position on the chromosome of (:Allele)-[:ALTERNATIVE_OF]-(:Allele)

I first want to filter out unwanted samples and my initial thoughts are:

MATCH (s:SAMPLE) WHERE NOT s.id = 'sample1' AND 'sample2' AND 'sample3'.... etc
RETURN s

In the hope that this would return a list of all Alleles except those specified? Would this be right, or is there a better way to filter out nodes?

Next, I intend to find all Alleles with fewer than 7 [:CARRIES] relationships with Samples, but there's another factor. I have three different names on these relationships depending on their type: 'homozygous','heterozygous - haplotype A' and 'heterozygous - haplotype B'. I need to double the count of the homozygous relationships, but count the other two only once each. My guess is as follows:

MATCH (a:Allele)<-[r1:CARRIES, r.name = “homozygous”]-(s:Sample) AND (a:Allele)<-r2[:CARRIES, r.name =~ “heterozygous.*”]

WITH a, count(r1 * 2) AS homozygous_count AND r2 AS heterozygous_count

WHERE allele_count <= 7

RETURN a

I'm not sure about the use of the AND operators though. Which should hopefully leave a list of Alleles with fewer than 7 relationships to Samples, including double those of homozygous? Does that make any sense, and how would I include the initial filter in the query?

Following that, the next stage is to use take each of these Alleles individually and identify Phenotypes linked to them through Samples. This one I'm assuming as:

MATCH (a:Allele)<-[:CARRIES]-(s:Sample)-[r]->(p:Phenotype)

RETURN p

COLLECT(a) as alleles

ORDER BY SIZE(alleles) DESC

This will hopefully generate a list of linked phenotypes, but is there a way to also get the properties of the relationships between Samples and Phenotypes? It might be even better if it were possible to get a mean or modal average value on certain properties, is that possible within a query?

An alternative option is to use something like below and analyse each different type of relationship between Sample and Phenotype nodes within specific parameters:

MATCH (a:Allele)<-[:CARRIES]-(s:Sample)-[:LDL]->(:Phenotype {name: cholesterol})

WHERE value:LDL > 7.0

RETURN phenotype,

COLLECT(a) as alleles

ORDER BY SIZE(alleles) DESC

If anyone can validate (or invalidate!), improve and help combine some of these queries then I would relaly appreciate the support. Thanks, Dave.

michael.hunger · January 29, 2019, 10:05am

Hey that's quite a number of questions.

I try to answer them in order:

For excluding things it often helps to tag them with a Label, like :Excluded or tag the positive samples.
Also make sure to not misspell you had Sample vs.SAMPLE`.

create constraint on (s:Sample) assert s.id is unique;

MATCH (s:Sample) 
WHERE NOT s.id IN $params
SET s:Excluded
RETURN count(*);

For degrees you can use WHERE size( (a)-[:CARRIES]->() ) < 7

For your more complex expression you can use:

MATCH (a:Allele)-[rel:CARRIES]->()
WITH a, sum(case r.type when 'homozygous' then 2 else 1 end) as count
WHERE count < 7
....

here you only missed a comma, and if you have multiple phenotypes a DISTINCT helps
you might want to add a limit:

MATCH (a:Allele)<-[:CARRIES]-(s:Sample)-[r]->(p:Phenotype)

RETURN p, COLLECT(distinct a) as alleles

ORDER BY SIZE(alleles) DESC LIMIT 100

Sure you can also build up a more complex structures.

RETURN p, COLLECT({allel:a, value1: s.foo, value2:carries.value, value3:r.value, ...}) as alleles

which gives you a more complex list of maps/dictionaries as result.

filtering on value.LDL > 7 is also possible (dot not colon)

Topic		Replies	Views
Fisher's Exact Test Cypher apoc , cypher , memory	19	684	September 1, 2020
Question about cypher query counting results Cypher querying , cypher	2	943	November 12, 2018
Filtering and Aggregation operation Cypher neo4j , count , filtering	6	527	November 10, 2023
Slow cypher Cypher	7	550	June 23, 2021
Filtering by relationship properties Cypher	6	9536	April 15, 2021

Take the Course Then Join The Aura Agent Hackathon

Cypher query validation for genome analysis

Related topics

Take the Course Then Join
The Aura Agent Hackathon