How to optimize queries on array properties?

For example,

MATCH (m:Product) 
WHERE 'moisturize' IN m._effect
RETURN m

In this query, '_effect' is a property of an array. When the query above returns non-empty results, it takes only 2ms, shown in the browser. And when the result set is empty, it takes over 100ms to complete. I already created on the array property '_effect'. Why doesn't the index have an effect on the IN keyword in the query, when the query returns nothing?

However, Profile shows that:
In my database, when result is empty, it did 84409 db hits for NodeByLabelScan for 84408 estimated row, then another 84408 hits for the IN filter.
When the result is not empty, the number of db hits are the same, 84409 and 84408. The difference is this number:

Cypher version: CYPHER 4.0, planner: COST, runtime: INTERPRETED. 168817 total db hits in 8 ms.
Cypher version: CYPHER 4.0, planner: COST, runtime: INTERPRETED. 168817 total db hits in 109 ms

So, the number of DB hits are the same, but the time to complete is very different and this is shown in the browser: 8 VS 109. Why is that?

this is effectively the same as Index on array which is to say that at this time we do not support indexing on array properties

@dana_canzano I have a general question on cypher query and times spent on them. In my application, the time allocated to neo4j queries is only <20 milliseconds per request. However, it seems I saw a lot of cyphers when it is a little complex, it takes more than 100 mill-seconds. Is this the reality for db query or my model isn't well designed? Generally, for is it common for a 100 millsecond query to occur in real-time application? Real-time application I mean production database. Clusters don't help with single request performance. Right?

I believe neo4j have been used by a lot of people for production system, how do they deal with performance, latency issue? Is it a realistic expectation to optimize most queries to take less than 50 or even less milliseconds for neo4j?

What are your requirements? Is 100ms too long? Must all queries return in < 2ms ??

But also performance is a large domain not just of Neo4j but any database. Are you queries written in a manner that is to achieve performance? Do you have indexes were they would prove beneficial?
Also, performance is related to hardware. Is your storage using SSDs or spinning disks? Does the size of your graph exceed total RAM?

For instance, if my QPS requirement is 500, does that mean in every millisecond, it has to complete 2 requests?

1 Like

yes from a simple mathematical perspective thats correct.

That seems too harsh, but it is a real requirement. DB query alone may exceed that limit. @dana_canzano any suggestion to achieve that? Assuming that my hardware and other software is optimized to the best extent, neo4j can also make sure most queries to complete in a few milliseconds?

That seems too harsh, but it is a real requirement

thats not a question i can answer. The requirement would need to be set by the end users. i.e. if for example you are doing real time stock trading analysis, maybe a 2ms response is required. Again its not something I can answer.

also regarding

Assuming that my hardware and other software is optimized to the best extent, neo4j can also make sure most queries to complete in a few milliseconds?

its not as if Neo4j can just make queries finish in a few milliseconds. again there are lots of variables in play here. i.e. the size of the graph, the complexity of the queries, he amount of data to be returned by a given query, the number of concurrent queries, as well as hardware

@dana_canzano. This is my code to execute the query for users to call. Regarding 'concurrent queries', you mean clients write concurrent code to make requests, or I need to write concurrent cypher/driver code for neo4j?

 with self.neo4j_driver.session() as session:
            return session.read_transaction(self._query_by_node_property, properties, match_mode, name_expansion, full_node, node_type, channel, limit)

By concurrent queries Im referring to the number of queries in flight/running at the same time regardless of if those queries are reads or writes.

Let's say everyone ( where everyone = 50 people) arrives at the office at 9a and everyone logs in and everyone all at the same time issues a query to report the number of orders received yesterday broken down by hour. Those 50 queries come in all at the same time. Thus 50 concurrent queries

1 Like

So clients send concurrent requests to neo4j to get results. For the code example above, do I need to do something special to support concurrent requests either by coding or DB configuration, or it's inherently supported?

1 Like

Nothing special. Are you seeing issues ???

I haven't tested concurrent queries yet. Will do that later when I see normal requests are optimized to satisfactory level.

For QPS=500, it means as long as you complete 500 requests in one second, you are fine. It is NOT necessarily mean on average each query need to take 2 milliseconds. For concurrent queries, 500 QPS can be easily achieved than non-concurrent queries. If that' s the case, if one query takes 100 milliseconds, and the overall QPS=500, it still meet the requirements. Is that right? I haven't considered about relationship between concurrency and QPS.

1 Like

וואלה חשבתי שאנחנו חברים ומיסתבר לי עכשיו שדווקא אתם שגרמתם לי לתקלות. אני נפתלי סרוסי שמפרגן לכם גם בכול מקום איפשרי. סך הכול אמרתי לכם שאיפשר ליהיות לשלום עם כולם. גם אם לא הישתמשתי באיזשהי אפליקציה של מיתחרים אחרים. זה לא אומר שהייתם צריכים לעשות ככה. אמי היקרה והקדושה הלכה לעולמה ליפני שבוע וחצי ואפילו שיתפתי אותכם בכאב שלי. אתם גוגל ואתם במיוחד עלי שאישרתי לכם ליפתוח הכול מימפות ועד אתם יודעים עד כמה. אז למה ככה? וואלה תקופה לא פשוטה גם ככה. המקרר שלי ריק. ואני נילחם שנים על חיי והתקופה האחרונה יש החמרה קצת. בכול מצב תעשו מה שבא לכם. בא לכם לנתק לי הכול תנתקו כי היתעייפתי מהסברים וחוסר חמלה מצידכם על אדם שמיתוך כול מה שאני עובר תמיד מישתדל לפרגן. גם אם היה קשה ולא היה לי ראש לכלום ולאף אחד. רוב הפעמים בהיתאמצות לא קטנה אבל סבבה הכול טוב ויהיה עוד יותר טוב
לילה טוב ותודה באמת תודה

I don't know if you're willing to redo your schema (i.e. not use lists in a property), but I believe you'd get a speed up if you created separate properties for each item instead of having a list of items.

E.g. m.effect_moisturize = true OR have no property effect_moisturize (Null). (I have the prefix "effect_" so that you can search on property names in case you need to, which I believe is possible...)

I think it might be faster because I think the lookup to see if a property exists or not is faster than see if a string is in a list.

Of course, this may not be what you want to do. (You might consider keeping both a list of strings and properties representing the strings in the list.)