Hi all,
I am struggling with an optimization of my Graph Model and I was hoping you could help me!
All cypher query results and profiles have been performed on Neo4j Desktop 1.2.7 and Neo4j 4.0.3
I have a graph model in which I label my Gene nodes CREATE (g:Symbol:Gene {gname:'Gene1})
,
as some of you may know genes can have quite a large number of aliases and I am on the fence whether I should model them in one of the following ways:
(g:Symbol:Gene)-[:HAS_ALIASES]->(a:Aliases {synonyms:['Alias1', 'Alias2', ...]})
(g:Symbol:Gene)-[:HAS_ALIAS]->(a:Alias {synonym:'Alias1'}), (g:Symbol:Gene)-[:HAS_ALIAS]->(a:Alias {synonym:'Alias2'}), ...
It is important for me that these lookups are as fast as possible because you never known beforehand if an incoming dataset containt all official symbols (eg. HGNC) or synonyms, or even a combination of both.
I have run a quick profile using both approaches and the results were a bit surprising to me. For the test I created 1 gene with 4 aliases using approach (1) and (2). When I profiled my query approach (2) had fewer db hits but took longer than approach (1). And when I indexed the 'synonym' property it took even longer with even fewer db hits?
I thought approach (2) would win for sure because Neo4j is optimized for traversels and not the retrieval of a long list of properties. Can someone explain to me why this is happening? Or suggest a better way of modelling this? Because this problem also translates to other id's, especially Ensembl gene and protein ID's.
Thanks in advance for your feedback!