Did you try adding an index to the location that I asked about above? I believe this will cut down a lot on the number of calculations made.
Yes, I had added an index with:
CREATE INDEX index_user_location FOR (n:GraphUser) ON (n.location)
Do I need to create it with different options?
Hi @aross , what is your SLA on running this query ? How much time it took before ? What is the performance baseline ? Is your objective to decrease the db hits ? what is the other rdbms you are comparing it to ? After clearing the cache, how much time did you re-run it again ?
Remember having an index doesn't solve everything (it will have performance issues when writing the data). According to RDBMS theory if the optimizer has to scan all the data, it will swift from Index scan to full table index scan after 25-35% scanning the statistics, because it doesn't want to scan through all the index leaf nodes.
you mentioned RDBMS, which RDBMS and what is the hardware spec ? what is your macOS spec ? is this part of an enterprise production database farm, or it is just your local PoC.
are you migrating the data from the RDBMS for any reasons such as cost , licensing etc .
Can you verify if you are running enterprise or community version ?
Apologies to ask so many questions. I just want to understand the objective both technical and non-technical .
I haven't done anything with geospatial before but... with my experience and book knowledge, I can ask sanity questions to help diagnose the problem...
In the very first PROFILE you show in this thread, the distance query resulted in 6M DB hits. How many total GraphUser nodes do you have in your system? If it's 6M, then the original query probably had to do a linear search (painfully slow!) over the entire GraphUser nodes.
In theory, a geospatial index should help reduce the search space as the query should be able to quickly exclude points outside of the bounding box using the index. (It would do a gathering of nodes that fell within the bounding box of lat, long ranges corresponding to a distance of 100000 ). I'm not sure if you had the geospatial index on location with the first attempt or not.
Are you using PROFILE
or EXPLAIN
? I think PROFILE
might be better as it tells you actually what occurred.
One thing to try, is do the query with and without the index on the location. If it doesn't improve, then it's geospatial indexing is not behaving the way that I expect and you'll have to ask someone else... (You could duplicate the DB and then drop the geospatial index in the duplicated DB and rerun the query.)
I am somewhat puzzled by the PROFILE you did based on my subquery suggestion. The subquery 91,872 DB hits but the following Geospatial query has almost exactly twice as many hits: 183k. This suggests to me that geospatial search on the results of the subquery has to do a linear search twice (once for lat and once for long.). I guess this makes sense, but it's why the query is still slow.
The other thing, are there any other properties that you can narrow down the search on before passing it onto the geospatial search?
Lastly, what if you make the distance to something smaller? (e.g. 1000 instead of 100000?) If the indexing is working, then I believe the PROFILE
might show far fewer DB hits. If it doesn't, then something isn't working the way that I expect.
I need to think about this some more...
Hi @dominicvivek06 , thank you for your help.
I'll take a step back and explain my motivation. I have a social app currently live in production that is using postgres as it's primary datastore (AWS/RDS). Data is spread across other databases, but currently the tables storing social data are in postgres. This decision was made for familiarity and to speed up development, and also because some business decisions in the beginning did not obviously point to graph as the best option.
We have a relatively small userbase, but due to the way social data grows, we already have some large tables (20M+ rows) that we are doing multiple joins on for some queries. While current infrastructure is fine at this level, this data will only get exponentially bigger, and we are hoping, of course, for many more users when we enter our growth phase later in the year.
So, I am exploring moving the social graph out of postgres and into an actual graph database for long-term growth. Exporting/importing these rows and relationships was quite easy and certain queries already make so much more sense in graph (dealing with established graph relationships).
But I am somewhat confused by the query I've been asking about on this thread: a simple geospatial query that is performing much slower than something similar in postgres. And this is just the first step in a much larger query that will look for both nodes with a relationship to a given user and union those with other nodes without a specific relationship (which will bring cartesian products into this query). If i can't get it to properly use the right index for this one part, I'm feeling fairly hopeless about getting the rest of it right.
Right now, I'm just testing and profiling these queries on my local Mac (however, it is a very good machine, and postgres runs very fast on it with our full production database). I am just running Neo4j Desktop but can try the enterprise or community version if you think it would be useful (the Desktop version has been convenient).
Anyway, that's the high-level view. I'd love for neo4j to work because it does make sense for this collection of data.