Inconsistent run time for query

py2neo
jupyter

(Christine N Buckler) #1

My query time varies from 1min to 5min to 20min when running. Is there a reason for this inconsistency? Is it because of the rand() in the first MATCH?

MATCH (a:Tops) WITH a ORDER BY rand() LIMIT 1
MATCH   (a) -[s1:CC_SCORE]- (b:Bottoms),
(a) -[s2:CC_SCORE]- (c:Shoes),
(a) -[s3:CC_SCORE]- (d:Bags),
(a) -[s4:CC_SCORE]- (e:Jewelry),
(b:Bottoms) -[s5:CC_SCORE]- (c:Shoes),
(b:Bottoms) -[s6:CC_SCORE]- (d:Bags),
(b:Bottoms) -[s7:CC_SCORE]- (e:Jewelry),
(c:Shoes) -[s8:CC_SCORE]- (d:Bags),
(c:Shoes) -[s9:CC_SCORE]- (e:Jewelry),
(d:Bags) -[s10:CC_SCORE]- (e:Jewelry)
RETURN toFloat(s1.score) + toFloat(s2.score) + toFloat(s3.score) + toFloat(s4.score) 
+ toFloat(s5.score) + toFloat(s6.score) + toFloat(s7.score) + toFloat(s8.score) 
+ toFloat(s9.score) + toFloat(s10.score) AS totalScore, a, b, c, d, e
ORDER BY totalScore DESC LIMIT 1;```

- neo4j version 3.3.4
- using py2neo in jupyter notebook
- `PROFILE` image attached

(Michael Hunger) #2

Can you share the PROFILE output of your query? It seems that the attachment didn't make it.
I also don#t see a rand() in your query() but yes, it could totally affect the volume of data processed depending on how you use the results.

You should add relationship directions
and possibly a label for a.

What is CC score?

This is a global query. So it might depend also on your configured memory and graph size.
What is your heap/page-cache config?
And are you using community or enterprise (which comes with Neo4j Desktop).


(Christine N Buckler) #3

Sorry not sure why the PROFILE plan image didn't come through the first time... Also I edited the query text, for some reason the first line with the rand() was hidden. CC_SCORE is a pairwise score (0-1) between nodes. Only relationships higher than .99 were added to the graph. Where can I find the heap/page-cache config? I am using the community edition but I believe my company does have enterprise.


(Michael Hunger) #4

page-cache and heap are configured in neo4j.conf which depending on your system is either in $NEO4J_HOME/conf or /etc/neo4j/neo4j.conf

as you can see in your profile, it touches quite a lot of data

there might be several ways to optimize the query, one could be to limit cardinality earlier by summing duplicate values

another one could be by moving from a single match statement to one per pair, so relationship-uniqueness doesn't have to be computed

picking the first value is probably easier by returning id's for Tops and then on the client picking one out of that list.


(Christine N Buckler) #5

from the conf file:
#dbms.memory.pagecache.size=10g (currently commented)
dbms.jvm.additional=-XX:+AlwaysPreTouch
are these the settings you were looking for?

Can you expand on what you mean by limit cardinality? I'm not seeing where the duplicate values are.
I am also thinking about trying a single MATCH per pair; Not sure if this could cause dead ends later on.
Why would you return a property instead of the node itself?


(Michael Hunger) #6

Yes that pagecache setting
and there is a dbms.heap.size setting too

What I meant is if you do a 3-hop-expand then at the first hope all neighbours are unique but at hop 2 and 3 you revisit certain nodes multiple times (reachable via different ways) and then those have to be expanded again. So aggregating those again to a minimal set is beneficial.

I'm still pondering how to best rewrite your query.


(Michael Hunger) #7

Would it by change possible to share your graphdb with me?


(Christine N Buckler) #8

is there an easy way to share the graph db with you directly?


(Michael Hunger) #9

You can PM me a dropbox/drive/s3 link of the zipped graph.db folder.

Thank you.


(Christine N Buckler) #10

I'm not seeing an option to PM you. Is this feature not available for all users? Perhaps through another platform?