I'm transitioning from SQL to Neo4j and am accustomed to indexing foreign key columns and filtering on them in relational databases. In SQL, if a table has two foreign keys, I can filter on both, and thanks to indexing, this filtering happens very quickly.
Now, I'm working with a graph database for the first time and have encountered a scenario where my initial filtering feels like it's performing a full table scan, which is obviously very inefficient.
I want to find all Person nodes who work at a specific Company and live in a specific City. I know I can start the MATCH pattern from either the Company or City nodes and then filter down to the Person nodes, but the number of Person nodes is still very large, making this approach slow.
Question:
Is there a technique in Neo4j to index these relationships to make the query more efficient? Any advice or best practices would be greatly appreciated.
In Neo4j, traversing relationships (e.g. from company to person or city to person) is just a pointer lookup - which is way faster than even an index lookup. You can create an index on city (maybe on the name property) and also on company to help neo4j find the starting point quickly. Have you tried using EXPLAIN or PROFILE to see how Neo4j is actually processing your query?
Hi John, thanks for your reply. Indexing Company or City isn't the problem, it's the case where I want to filter Person based on its relationship with both Company and City. This would, as far as I can tell, require testing all Person nodes to see if such a relationship exists - ie. similar to performing a full table scan. This is the scenario I want to avoid. Thank you
I get your point, but that's likely not how Neo4j will execute the query. If you start from, say, a Company - Neo4j can traverse the [WORKS_AT] relationships to find the people. Even though relationships are directed, Neo4j can traverse them in both directions. Maybe you can try a query (with EXPLAIN) and check the results
Thanks for your insights. I understand that Neo4j’s relationship traversal is more efficient than an index lookup in a relational database. However, my concern isn't about the traversal itself, but rather about the potential inefficiency when filtering Person nodes by both Company and City relationships.
To clarify, my worry is that in order to find Person nodes that both WORK_AT a specific Company and LIVE_AT a specific City, Neo4j might need to check all Person nodes to see if they have these relationships, which seems analogous to performing a full table scan in SQL.
I'm looking for advice on how to avoid what feels like this exhaustive filtering process. Is there a way to optimize this type of query so it doesn’t involve checking all Person nodes for these two relationships?
I hope this clears up my concern. I appreciate any guidance you can offer!
MATCH (p:Person)-[:WORKS_AT]->(:Company{name: $company_name})
WHERE EXISTs { (p)-[:LIVES_AT]->(:City{name: $city_name}) }
RETURN count(p)
Mainly, for myself to remember when I read my own code, that I make the assumption that there are fewer persons working for a company than persons that live in a city. I expect the part (p:Person)-[:WORKS_AT]->(:Company{name: $company_name}) to be more "selective" thus reducing "wasted traversing of relathinships".
However, the query planner, will use statistics about your graph to select what leg of the path (:City)<-[:LIVES_AT]-(:Person)-[:WORKS_AT]->(:Company) to expand first. And it may change as your graph grows.
Thanks for the input. One approach I’m considering is adding companyId and cityId as properties directly on the Person node and then creating a composite index on these properties. This would allow me to efficiently query Person nodes based on both relationships.
However, this approach would require manually updating these properties whenever the WORKS_AT or LIVES_IN relationships change. Alternatively, do you know if I could use APOC triggers to automate this process?
Do you think this is a viable solution for optimizing the query, or do you see any obvious downsides to it?
For the type of query where you ancor on two nodes to traverse two "legs", it is always a worry that it will become a lot of work for the supernodes. Like "Google" and "Mountain View". But just as you describe, optimizing the model for read queries, may have an impact on write queries (why is everything in life a tradeoff?). EDIT: I also see a problem with your suggestion if persons work for more than one company or lives in more than one place.
My recommendation (not how to solve but how to approach) is to wait and see for a while, until you have a more evidence (or more quries) for how your application will interact with the graph.
With some luck, the query "what persons lives in x and works for y" will not be important for your application. Unless your application is manly analytical/reporting. And if that is the case, just throw it at the parallel runtime.
Apologies for my wild guesses and assumptions about your application.