Query optimization on cyclical patterns in chemical reaction network

Hi all, I'm a newbie with Neo4j trying to do some pattern matching on a weird structure. I was able to get results for some of the query, but I had to go into my DB settings and increase the max heap size to ~18G in order for the query to run in a reasonable time. Now that I have added more features to the query (the MATCH clause below with autocatPath), it simply takes too much time to run and I don't physically have more memory to allocate...

Do you have any ideas to optimize the query for my desktop? At what point would you trade off between optimization and hosting on the cloud?

The schema of my graph is a single label graph where the label is "Molecule", and single edge where the edge is "FORMS". So (:Molecule)-[:FORMS]->(:Molecule). (I would include pictures but I am limited since I'm a new user...)

Database instance statistics: version 3.5.12 Enterprise, 6928 nodes, 22025 relationships.

Here is the full query:

// Match the major structures
// 0. Ring structure
MATCH ringPath=(beginMol:Molecule)-[:FORMS*3..5]->(beginMol:Molecule)
UNWIND nodes(ringPath) as ringMol // Create the iterator ringMol to iterate over all the molecules in the ringPath, assuring that only the distinct smiles strings are counted in ringMols (otherwise results will duplicate the beginMol because the ring query starts and ends at beginMol)
WITH collect(distinct ringMol.smiles_str) AS ringMols, ringPath, relationships(ringPath) as ringRels, beginMol
// Filter the query by several conditions
WHERE size(ringMols) > 2 // controls number of molecules in the ring
AND size(ringRels) = size(ringMols) // asserts that the number of molecules must equal the number of relationships in the ring (so the relationships don't hop molecules more than once and create a collapsed ring figure 8 looking structure

// 1. Molecule in ring splitting to form the beginMol (autocatalytic structure)
MATCH autocatPath=(catMolInRing:Molecule)-[:FORMS]->(beginMol:Molecule)
WHERE catMolInRing.smiles_str IN ringMols
WITH ringMols, ringPath, ringRels, beginMol, catMolInRing, relationships(autocatPath) as autocatPathRels, nodes(autocatPath) as autocatPathNodes

// 2. Ring consumer structure
MATCH branchedBeginMolPath=(beginMol)-[:FORMS]->(beginMolConsumer:Molecule)
WHERE beginMol <> beginMolConsumer
WITH branchedBeginMolPath, beginMol, beginMolConsumer, ringMols as ringMols, ringPath as ringPath, ringRels as ringRels, catMolInRing, autocatPathRels, autocatPathNodes

// 3. Feeder structure (with additional consumer distinct from the consumer in the above step)
MATCH attachedPath=(feederMol:Molecule)-[:FORMS]->(intermediateMol:Molecule)-[:FORMS]->(consumerMol:Molecule)
WITH feederMol, intermediateMol, consumerMol, ringMols as ringMols, ringPath as ringPath, ringRels as ringRels, beginMol as beginMol, beginMolConsumer as beginMolConsumer, branchedBeginMolPath as branchedBeginMolPath, attachedPath as attachedPath, catMolInRing, autocatPathRels, autocatPathNodes
WHERE NOT beginMolConsumer.smiles_str IN ringMols
AND beginMolConsumer <> feederMol
AND beginMol <> feederMol
AND feederMol <> consumerMol
AND beginMol <> consumerMol
AND NOT feederMol.smiles_str IN ringMols
AND NOT consumerMol.smiles_str IN ringMols
AND intermediateMol.smiles_str IN ringMols
AND beginMolConsumer <> consumerMol
AND intermediateMol <> beginMol
AND NOT (beginMol:Molecule)-[:FORMS]->(beginMol:Molecule)<-[:FORMS]-(beginMol:Molecule) // Assert that the relationships in the ringPath must travel all in the same direction
// control the generation range at which the cycle is formed by assuming it can't exist until after the feederMol is formed
//{{COMMENT_OUT_FEEDER_GEN_LOGIC}}AND feederMol.generation_formed >= {{MIN_FEEDER_GENERATION}} 
//{{COMMENT_OUT_FEEDER_GEN_LOGIC}}AND feederMol.generation_formed <= {{MAX_FEEDER_GENERATION}}


// Finally, return results
RETURN ringMols, size(ringMols) as countMolsInRing, ringRels, relationships(branchedBeginMolPath) as branchedBeginMolPathRels, relationships(attachedPath) as attachedPathRels, beginMol, beginMolConsumer, feederMol, intermediateMol, consumerMol, nodes(ringPath) as ringPathNodes, nodes(branchedBeginMolPath) as branchedBeginMolPathNodes, nodes(attachedPath) as attachedPathNodes, catMolInRing, autocatPathRels, autocatPathNodes // ringPath, branchedBeginMolPath, attachedPath
LIMIT 1000

Here is a visualization of the target pattern I'm trying to match with this query:

So far, the closest I've gotten is something like this (which is missing the autocatPath pattern):

Some rules about the query if you have tips for restructuring:

  • The number of nodes in the ringPath can be 3..20 nodes long (I was doing 3..5 to limit the run time)
  • ringPath edges must all travel in the same direction
  • number of nodes and number of edges in ringPath must equal each other (so a ring structure is created rather than a pinched looking one where edges travel the same node twice)
  • beginMolConsumer, feederMol, and consumerMol must all be distinct from one another
  • intermediateMol can land anywhere in the ringPath, but it must be in the ringPath
  • catMolInRing can also land anywhere in the ringPath as long as it is more than 1 hop away from the beginMol. Another way to phrase this rule is that the edge between catMolInRing and beginMol must be distinct from the ringPath edges (i.e. catMolInRing can't be the same as the node in the ringPath which has an edge going to beginMol)

Thank you in advance for reading this far!
-J

Hi J,

Do you have any Sandbox or CSV in order to import dummy data? I wanted to create a testable scenario but no idea was on smiles_str (even tho I doesnt look to relevant).

Thanks,

H

Hi Harold, thank you for the response!

Here are the import text files, split into nodes/edges: https://github.com/Reaction-Space-Explorer/reac-space-exp/tree/cycle-queries-and-network-analysis/neo4j_loader_and_queries/mock_data/exported

-J

Hi J,

I may have a couple of things in my mind to comment about. I suggest you to create a new relation (I called it JUMP) in order to merge all the relations between 2 Molecules with more than 1 Forms relation (different rules, reaction ids...).

So far I've been testing just the ring detection with Oxigen(O) as a reference but I cant find a ring with more than 3 molecules (so I can't find your example):

MATCH(n:Molecule)
where n.smiles_str = 'O'
with n
MATCH(terminator:Molecule)-[:JUMPS]->(n)
with collect(terminator) as terminatorNodes, n
CALL apoc.path.expandConfig(n, 
{
  relationshipFilter: 'JUMPS>',
  uniqueness: 'NODE_GLOBAL',
  minLevel: 2,
  terminatorNodes : terminatorNodes
} ) yield path
return n, length(path) as l
order by l desc

Can you check if something similar in your data set gives you the ring?

Can you share the smiles_str of the nodes in the example?

Edit:
Are catMolInRing, beginMolConsumer, feederMol, consumerMol mere classifications? I mean, is catMolInRing unique per ring or is just a way to classify every mol inside of the ring that creates smallers rings? Idem for the rest.

Thanks,

H